pyguides

lzma module

Overview

lzma is Python’s standard library for LZMA compression — the algorithm behind .xz files. It supports both file-based compression and raw byte compression via compress() and decompress(). LZMA produces smaller files than gzip or bzip2 at the cost of slower compression, making it a good choice for archival or data that needs maximum compression.

Basic Compression and Decompression

import lzma

# Compress bytes
original = b"This is a test message that will be compressed."
compressed = lzma.compress(original)
decompressed = lzma.decompress(compressed)

print(len(original))       # 51 bytes
print(len(compressed))    # 72 bytes (actually larger for tiny input)
print(decompressed == original)  # True

For small inputs, compressed output can exceed original size. LZMA shines on larger data.

Compressing Files with LZMAFile

Use LZMAFile as a context manager, similar to gzip:

import lzma

# Write compressed
with lzma.LZMAFile("data.txt.xz", "w") as f:
    f.write(b"Hello, world!\n" * 1000)

# Read compressed
with lzma.LZMAFile("data.txt.xz", "r") as f:
    content = f.read()

# Read as text
with lzma.LZMAFile("data.txt.xz", "rt") as f:
    text = f.read()

LZMAFile accepts the same file modes as built-in open()r, w, rt, wt, and binary variants.

Reading Existing .xz Files

Python recognizes .xz files automatically when opened with open() if you use the lzma module’s file wrapper:

import lzma

# LZMAFile handles .xz magic bytes
with lzma.open("archive.xz", "r") as f:
    data = f.read()

This also works for writing — .xz extension is recognized and the appropriate header is written.

Preset Compression Levels

lzma.compress() and lzma.LZMAFile() accept a preset parameter from 0 (fastest, least compression) to 9 (slowest, most compression):

import lzma

data = b"A" * 100000

fast   = lzma.compress(data, preset=0)
medium = lzma.compress(data, preset=6)
best   = lzma.compress(data, preset=9)

print(len(fast))    # ~2900 bytes
print(len(medium)) # ~200 bytes
print(len(best))   # ~195 bytes

Default is preset=6. For most production use, preset 6 or 7 strikes a reasonable balance.

Check and Extreme Presets

For maximum compression beyond preset=9, use the extremes format:

import lzma

data = b"A" * 100000

extreme = lzma.compress(data, format=lzma.FORMAT_XZ, preset=9, check=lzma.CHECK_NONE)
# Or with the reserved extremes preset
extreme = lzma.compress(data, preset=9 | lzma.PRESET_EXTREME)

LZMA Format Variants

lzma supports three format types:

FormatConstantUse case
FORMAT_XZDefault for filesStandard .xz format
FORMAT_ALONE.lzma (legacy)Old LZMA Utils format
FORMAT_RAWRaw streamCustom codec use
import lzma

# Write in legacy .lzma format
with lzma.LZMAFile("data.txt.lzma", "w", format=lzma.FORMAT_ALONE) as f:
    f.write(b"some data")

# Read a raw LZMA stream (no file header)
raw = lzma.compress(b"raw bytes", format=lzma.FORMAT_RAW)

Tuning with Filters

Filters give fine-grained control over compression behavior. They’re passed to compress() or LZMAFile as a list:

import lzma

# LZMA2 filter (default for .xz)
filtered = lzma.compress(
    b"A" * 100000,
    filters=[{"id": lzma.FILTER_LZMA2, "preset": 9}]
)

# Delta filter (useful for images/audio — stores differences)
image_data = open("raw.img", "rb").read()
delta_compressed = lzma.compress(
    image_data,
    filters=[
        {"id": lzma.FILTER_DELTA, "dist": 4},
        {"id": lzma.FILTER_LZMA2, "preset": 6},
    ]
)

The delta filter stores differences between consecutive bytes, which compresses well for images with smooth gradients.

Memory Usage

LZMA’s PRESET_HUGE affects memory consumption during decompression. For systems with limited RAM, lower presets use significantly less memory:

import lzma

# Decompress with limited memory (preset 0 decompression)
with lzma.LZMAFile("data.xz", "r", preset=0) as f:
    # Memory use is bounded by the preset level
    data = f.read()

Combining with tarfile

For multi-file archives, combine lzma with tarfile:

import tarfile
import lzma

# Create a .tar.xz archive
with tarfile.open("logs.tar.xz", "w:xz") as tar:
    tar.add("error.log")
    tar.add("access.log")

# Extract
with tarfile.open("logs.tar.xz", "r:xz") as tar:
    tar.extractall(path="./extracted")

Error Handling

import lzma

try:
    with lzma.LZMAFile("corrupt.xz", "r") as f:
        data = f.read()
except lzma.LZMAError as e:
    print(f"Corrupt or invalid LZMA data: {e}")
except FileNotFoundError:
    print("File not found")

Common exceptions: LZMAError for corrupt data, FileNotFoundError for missing files.

Gotchas

Tiny inputs get bigger. LZMA headers and block structure overhead exceed savings on small inputs. For data under a few hundred bytes, compression may increase size.

Slow compression, fast decompression. LZMA decompression is relatively fast. If you’re compressing once and decompressing many times (like distribution archives), the upfront cost is worth it.

Decompression memory scales with preset. Higher presets require more RAM to decompress. On memory-constrained systems (embedded, small containers), use preset 0-3 for decompression.

.xz and .lzma are different formats. Python’s lzma module writes .xz by default. To write the legacy .lzma format, explicitly pass format=lzma.FORMAT_ALONE.

See Also