statistics
The statistics module provides functions for calculating mathematical statistics of numeric data. It covers measures of central location (mean, median, mode) and measures of spread (variance, standard deviation). The module works with int, float, Decimal, and Fraction types.
This module is not a competitor to NumPy or SciPy. It targets the level of graphing calculators, useful for everyday statistical calculations without adding heavy dependencies.
Syntax
import statistics
# Basic usage
statistics.mean(data)
statistics.median(data)
statistics.mode(data)
Averages and Central Location
mean()
The arithmetic mean (commonly called “average”) is the sum of data points divided by the count. It’s sensitive to outliers.
statistics.mean(data)
| Parameter | Type | Default | Description |
|---|---|---|---|
data | sequence or iterable | required | Numeric data to calculate mean from |
Returns: The arithmetic mean (type matches input: int, float, Decimal, or Fraction)
Raises: StatisticsError if data is empty
import statistics
data = [1, 2, 3, 4, 4]
result = statistics.mean(data)
print(result) # 2.8
# Works with Decimals
from decimal import Decimal
data = [Decimal("0.5"), Decimal("0.75"), Decimal("0.625")]
print(statistics.mean(data)) # 0.625
The mean gives an unbiased estimate of the population mean when working with samples. However, it’s strongly affected by outliers. For a more robust measure, consider median().
median()
The median is the middle value when data is sorted. It’s robust against outliers and gives a better “typical” value when data contains extreme values.
statistics.median(data)
| Parameter | Type | Default | Description |
|---|---|---|---|
data | sequence or iterable | required | Numeric data to find median from |
Returns: The median value. If even number of points, returns the average of the two middle values.
Raises: StatisticsError if data is empty
import statistics
# Odd number of points - returns middle value
print(statistics.median([1, 3, 5])) # 3
# Even number of points - returns average of middle two
print(statistics.median([1, 3, 5, 7])) # 4.0
When you have discrete data and want the median to be an actual data point rather than interpolated, use median_low() or median_high().
mode()
The mode returns the most frequently occurring value. It’s the only statistics function that works with nominal (non-numeric) data.
statistics.mode(data)
| Parameter | Type | Default | Description |
|---|---|---|---|
data | sequence or iterable | required | Discrete or nominal data |
Returns: The most common value
Raises: StatisticsError if data is empty
import statistics
# Most common number
print(statistics.mode([1, 1, 2, 3, 3, 3, 3, 4])) # 3
# Works with strings (nominal data)
print(statistics.mode(["red", "blue", "blue", "red", "green", "red"])) # 'red'
If there are multiple modes with the same frequency, mode() returns the first one encountered. Use multimode() to get all modes.
geometric_mean()
The geometric mean uses the product of values rather than their sum. It’s appropriate for data that represents rates or ratios.
statistics.geometric_mean(data)
| Parameter | Type | Default | Description |
|---|---|---|---|
data | sequence or iterable | required | Numeric data (must be positive) |
Returns: float - the geometric mean
Raises: StatisticsError if data is empty, contains zero, or negative values
import statistics
# Growth rates example
rates = [1.05, 1.10, 1.08] # 5%, 10%, 8% growth
print(statistics.geometric_mean(rates)) # 1.0756... (approximately 7.56% average growth)
harmonic_mean()
The harmonic mean is the reciprocal of the arithmetic mean of reciprocals. It’s appropriate for averaging rates or speeds.
statistics.harmonic_mean(data)
| Parameter | Type | Default | Description |
|---|---|---|---|
data | sequence or iterable | required | Real-valued numeric data |
weights | sequence or None | None | Optional weights for each value |
Returns: float - the harmonic mean
Raises: StatisticsError if data is empty or contains negative values
import statistics
# Average speed example - car travels 10 km at 40 km/h, then 10 km at 60 km/h
speeds = [40, 60]
print(statistics.harmonic_mean(speeds)) # 48.0
# With weights - car travels 5 km at 40 km/h, then 30 km at 60 km/h
print(statistics.harmonic_mean([40, 60], weights=[5, 30])) # 56.0
Measures of Spread
variance() and stdev()
Variance measures how far data points spread from the mean. Standard deviation is the square root of variance, returning the measure to the original units.
statistics.variance(data, xbar=None)
statistics.stdev(data, xbar=None)
| Parameter | Type | Default | Description |
|---|---|---|---|
data | sequence or iterable | required | Numeric data |
xbar | float or None | None | Known mean (optional, avoids recalculation) |
Returns: float - sample variance or standard deviation
Raises: StatisticsError if data has fewer than 2 values
import statistics
data = [2, 4, 4, 4, 5, 5, 7, 9]
# Sample variance and standard deviation
print(statistics.variance(data)) # 4.571428571428571
print(statistics.stdev(data)) # 2.138...
# If you already know the mean, pass it to avoid recalculation
mean = statistics.mean(data)
print(statistics.variance(data, mean)) # Same result, but faster for large datasets
Use variance() and stdev() when working with a sample from a larger population. Use pvariance() and pstdev() when you have the entire population.
quantiles()
Divides data into intervals with equal probability. Useful for percentile calculations.
statistics.quantiles(data, n=4, method='exclusive')
| Parameter | Type | Default | Description |
|---|---|---|---|
data | sequence or iterable | required | Numeric data |
n | int | 4 | Number of quantiles to produce |
method | str | ’exclusive' | 'exclusive’ or ‘inclusive’ |
Returns: list of floats - the quantile boundaries
import statistics
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
# Quartiles (4 quantiles = 3 boundaries for 4 groups)
print(statistics.quantiles(data, n=4))
# [2.75, 5.5, 8.25]
# Deciles (10 quantiles)
print(statistics.quantiles(data, n=10))
# [1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9]
Common Patterns
Handling missing data (NaN values)
Some statistics functions have unexpected behavior with NaN values. Strip them before processing:
import statistics
from math import isnan
from itertools import filterfalse
data = [20.7, float('nan'), 19.2, 18.3, float('nan'), 14.4]
# Clean the data
clean_data = list(filterfalse(isnan, data))
print(statistics.median(clean_data)) # 18.75
Using with different numeric types
The module preserves numeric types for most functions:
from decimal import Decimal
from fractions import Fraction
import statistics
# Decimals
data = [Decimal("1.1"), Decimal("2.2"), Decimal("3.3")]
print(statistics.mean(data)) # 2.2
# Fractions
data = [Fraction(1, 2), Fraction(3, 2), Fraction(5, 2)]
print(statistics.mean(data)) # 3/2