Compressing Data with zlib
The zlib module is Python’s interface to the zlib compression library, which implements the DEFLATE algorithm. It’s the foundation for many other compression formats you’ll encounter in Python, including gzip and zip files. Understanding zlib gives you low-level control over compression for data storage, network transmission, or memory optimization.
Why Use zlib?
The zlib library is everywhere. It’s fast, portable, and produces good compression ratios for text data. When you need to compress data in memory—before saving to a file, sending over a network, or storing in a database—zlib is often the right tool.
Common use cases include:
- Compressing data before saving to disk or database
- Reducing network payload sizes for API requests
- Storing cached data more efficiently
- Working with formats that use zlib internally (gzip, zip, PNG, PDF)
Basic Compression and Decompression
The zlib module provides the core compress() and decompress() functions for in-memory operations:
import zlib
# Compress some data
original_data = b"This is a string of text that we want to compress."
compressed = zlib.compress(original_data)
print(f"Original size: {len(original_data)} bytes")
print(f"Compressed size: {len(compressed)} bytes")
print(f"Compression ratio: {len(compressed) / len(original_data):.2%}")
# Original size: 52 bytes
# Compressed size: 46 bytes
# Compression ratio: 88.46%
Text with repetition compresses better than random data:
import zlib
# Repetitive text compresses well
repetitive = b"AAAAAAAABBBBBBBBCCCCCCCCDDDDDDDD" * 10
compressed = zlib.compress(repetitive)
print(f"Original: {len(repetitive)} bytes")
print(f"Compressed: {len(compressed)} bytes")
# Original: 320 bytes
# Compressed: 28 bytes
Decompressing is straightforward:
import zlib
data = b"Some data to decompress"
compressed = zlib.compress(data)
decompressed = zlib.decompress(compressed)
print(decompressed) # b"Some data to decompress"
print(decompressed == data) # True
Compression Levels
The compresslevel parameter controls the trade-off between speed and compression ratio. Valid values are 0 (no compression) through 9 (maximum compression):
import zlib
data = b"The quick brown fox jumps over the lazy dog. " * 50
for level in [0, 1, 5, 9]:
compressed = zlib.compress(data, level)
print(f"Level {level}: {len(compressed)} bytes ({len(compressed)/len(data):.1%})")
Typical output:
Level 0: 1700 bytes (100.0%)
Level 1: 164 bytes (9.6%)
Level 5: 152 bytes (8.9%)
Level 9: 150 bytes (8.8%)
Level 1 is fast but provides good compression. Level 9 is slower but only marginally better. For most applications, level 6 (the default) or level 1 offers the best balance.
Working with Streams
For large data or streaming scenarios, use Compressor and Decompressor objects:
import zlib
# Compress data in chunks
compressor = zlib.compressobj(level=6)
chunk1 = b"First part of the data..."
chunk2 = b"Second part with more content."
chunk3 = b"Third and final part."
compressed_chunks = []
compressed_chunks.append(compressor.compress(chunk1))
compressed_chunks.append(compressor.compress(chunk2))
compressed_chunks.append(compressor.flush())
compressed_data = b"".join(compressed_chunks)
print(f"Compressed: {len(compressed_data)} bytes")
Decompressing streams works similarly:
import zlib
decompressor = zlib.decompressobj()
# Decompress in chunks
result_chunks = []
result_chunks.append(decompressor.decompress(compressed_data))
result_chunks.append(decompressor.flush())
decompressed = b"".join(result_chunks)
print(decompressed)
This pattern is essential when working with network streams or files that don’t fit in memory.
Practical Examples
Caching Compressed Data
import zlib
import pickle
def cache_compressed(cache, key, value, level=6):
"""Store compressed data in cache."""
serialized = pickle.dumps(value)
compressed = zlib.compress(serialized, level)
cache[key] = compressed
def load_compressed(cache, key):
"""Retrieve and decompress data from cache."""
compressed = cache.get(key)
if compressed is None:
return None
serialized = zlib.decompress(compressed)
return pickle.loads(serialized)
# Example usage (with a dict as cache)
my_cache = {}
cache_compressed(my_cache, "results", {"data": [1, 2, 3, 4, 5]})
result = load_compressed(my_cache, "results")
print(result) # {'data': [1, 2, 3, 4, 5]}
Compressing Before Network Transmission
import zlib
import json
def compress_payload(data):
"""Compress JSON data for network transmission."""
json_data = json.dumps(data).encode("utf-8")
compressed = zlib.compress(json_data, level=6)
return compressed
def decompress_payload(compressed_data):
"""Decompress received data."""
json_data = zlib.decompress(compressed_data)
return json.loads(json_data)
# Example
payload = {"messages": ["hello", "world"] * 100}
compressed = compress_payload(payload)
print(f"Sent {len(compressed)} bytes instead of {len(json.dumps(payload))} bytes")
received = decompress_payload(compressed)
print(received == payload) # True
Creating zlib-Wrapped Protocol Messages
import zlib
import struct
def create_message(payload):
"""Create a zlib-compressed message with length prefix."""
compressed = zlib.compress(payload, level=6)
length = struct.pack(">I", len(compressed))
return length + compressed
def read_message(data):
"""Read a length-prefixed zlib-compressed message."""
length = struct.unpack(">I", data[:4])[0]
compressed = data[4:4+length]
return zlib.decompress(compressed)
# Example
message = create_message(b"Important data that needs compression")
print(f"Message length: {len(message)}")
payload = read_message(message)
print(payload) # b"Important data that needs compression"
Handling Errors
Decompression can fail if the data is corrupted or was never compressed:
import zlib
# Try to decompress invalid data
try:
zlib.decompress(b"not compressed data")
except zlib.error as e:
print(f"Decompression failed: {e}")
# Check if data looks compressed before decompressing
def safe_decompress(data):
"""Decompress with error handling."""
try:
return zlib.decompress(data)
except zlib.error:
return None
result = safe_decompress(b"invalid data")
print(result) # None
You can also use decompressobj() to handle partial or streaming data more gracefully.
Computing Checksums
zlib provides CRC-32 checksums for data integrity:
import zlib
data = b"Hello, World!"
# Calculate CRC-32 checksum
crc = zlib.crc32(data)
print(f"CRC-32: {crc}")
# Using adler32 (faster but less accurate)
adler = zlib.adler32(data)
print(f"Adler-32: {adler}")
The adler32 checksum is faster but less reliable for detecting errors. Use CRC-32 when accuracy matters more than speed.
See Also
- gzip-module — File compression built on zlib
- zipfile-module — Reading and writing zip archives
- python-gzip-guide — Working with gzip files