scikit-learn Intro: Your First ML Model
scikit-learn is the go-to library for machine learning in Python. Its API is refreshingly consistent: every algorithm follows the same fit(X, y) / predict(X) pattern regardless of how complex the math underneath is. That consistency makes it easy to swap algorithms, compare models, and build pipelines without rewriting your code.
This tutorial walks through a complete classification example from raw data to predictions. No ML background required.
Installing scikit-learn
pip install scikit-learn
The Estimator Pattern
Every model in scikit-learn is an “estimator”. They all share the same interface:
model = SomeClassifier() # instantiate
model.fit(X_train, y_train) # learn from data
predictions = model.predict(X_test) # predict on new data
That’s it. The same three steps work for classification, regression, clustering, and dimensionality reduction. Only the algorithm names and what predict returns change.
X is your feature matrix — rows are samples, columns are features. y is your target vector — the labels you want to predict.
A Complete Example: Iris Classification
The classic starter dataset. 150 flowers with 4 measurements, classified into 3 species.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Load data
iris = load_iris()
X, y = iris.data, iris.target
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train a model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
# Predict and evaluate
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy:.2%}")
# Accuracy: 100.00%
train_test_split randomly partitions your data. test_size=0.2 reserves 20% for evaluation. random_state=42 makes it reproducible — without it, you’d get different splits every run.
Train-Test Splitting
Good ML practice separates data into train and test sets. You train on one portion, evaluate on another you haven’t seen. This tells you whether your model generalises or just memorised the training data.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.25, # 25% for testing
train_size=0.75, # explicitly set train size (optional)
random_state=0, # reproducibility seed
stratify=y # keep class proportions equal in both sets
)
stratify=y is important when your classes are imbalanced. Without it, your test set might accidentally contain zero examples of a rare class.
Choosing a Model
Start simple. For most tabular classification problems:
LogisticRegression— fast, interpretable, works well as a baselineDecisionTreeClassifier— easy to visualise, prone to overfittingRandomForestClassifier— ensemble of trees, usually better accuracySVC(Support Vector Machine) — strong on medium-sized datasets
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
Checking Feature Names
When you load from a dataset utility, you can inspect what each column means:
print(iris.feature_names)
# ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
print(iris.target_names)
# ['setosa' 'versicolor' 'virginica']
This matters when you write preprocessing code or interpret model coefficients.
Inspecting a Trained Model
Once fitted, models store what they learned in attributes ending with _:
print(model.n_features_in_) # how many features the model expects
# 4
print(model.feature_names_in_) # original feature names (if provided)
# ['sepal length (cm)', 'sepal width (cm)', ...]
print(model.classes_) # class labels the model predicts
# [0 1 2]
# For linear models: coefficient weights
print(model.coef_) # shape: (n_classes, n_features)
Cross-Validation
A single train-test split might give you a lucky or unlucky split. Cross-validation runs multiple splits and averages the scores:
from sklearn.model_selection import cross_val_score
model = LogisticRegression(max_iter=200, random_state=42)
scores = cross_val_score(model, X, y, cv=5)
print(f"CV scores: {scores}")
# [1. 0.96666667 0.93333333 0.96666667 1. ]
print(f"Mean: {scores.mean():.2%}, Std: {scores.std():.2%}")
# Mean: 97.33%, Std: 2.71%
cv=5 runs 5-fold cross-validation. Each fold takes a different 20% as the test set. The mean and standard deviation tell you how stable the performance is.
Evaluation Metrics
accuracy_score is the simplest metric, but it hides class imbalance issues:
from sklearn.metrics import classification_report, confusion_matrix
predictions = model.predict(X_test)
print(confusion_matrix(y_test, predictions))
# [[10 0 0]
# [ 0 9 0]
# [ 0 1 10]]
print(classification_report(y_test, predictions))
# precision recall f1-score support
#
# 0 1.00 1.00 1.00 10
# 1 0.90 1.00 0.95 9
# 2 1.00 0.91 0.95 11
#
# accuracy 0.97 30
# macro avg 0.97 0.97 0.97 30
# weighted avg 0.97 0.97 0.97 30
classification_report gives precision, recall, and F1-score per class. confusion_matrix shows which classes get confused with which.
A Realistic Pipeline
Most real workflows include feature scaling and many preprocessing steps. Pipeline chains them together so preprocessing happens automatically during cross-validation:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
pipeline = Pipeline([
("scaler", StandardScaler()), # normalise features to mean=0, std=1
("classifier", SVC(random_state=42))
])
pipeline.fit(X_train, y_train)
print(f"Test accuracy: {pipeline.score(X_test, y_test):.2%}")
StandardScaler subtracts the mean and divides by standard deviation for each feature. Many algorithms (SVM, logistic regression, neural networks) perform better when features are on similar scales.
Hyperparameter Tuning
Models have settings that affect performance. GridSearchCV tests all combinations:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
param_grid = {
"n_estimators": [50, 100, 200],
"max_depth": [3, 5, 10, None],
"min_samples_split": [2, 5, 10]
}
model = RandomForestClassifier(random_state=42)
search = GridSearchCV(
model, param_grid, cv=5, scoring="accuracy", n_jobs=-1
)
search.fit(X_train, y_train)
print(search.best_params_)
# {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 50}
print(f"Best CV score: {search.best_score_:.2%}")
# Best CV score: 97.50%
print(f"Test score: {search.score(X_test, y_test):.2%}")
# Test score: 100.00%
n_jobs=-1 uses all CPU cores. scoring="accuracy" optimises for correct predictions. Swap it for "f1_macro" or "roc_auc" depending on your problem.
See Also
- /tutorials/scientific-python/ds-pandas-intro/ — DataFrames are what you feed into scikit-learn
- /tutorials/scientific-python/scipy-intro/ — scientific computing foundations that scikit-learn builds on