scikit-learn Intro: Your First ML Model

April 27, 2026 · 5 min read ·Updated May 20, 2026 ·beginner

scikit-learnmachine-learningpythonclassificationmodel-fitting

scikit-learn is the go-to library for machine learning in Python. Its API is refreshingly consistent: every algorithm follows the same fit(X, y) / predict(X) pattern regardless of how complex the math underneath is. That consistency makes it easy to swap algorithms, compare models, and build pipelines without rewriting your code.

This tutorial walks through a complete classification example from raw data to predictions. No ML background required.

Installing scikit-learn

pip install scikit-learn

The Estimator Pattern

Every model in scikit-learn is an “estimator”. They all share the same interface:

model = SomeClassifier()       # instantiate
model.fit(X_train, y_train)    # learn from data
predictions = model.predict(X_test)  # predict on new data

That’s it. The same three steps work for classification, regression, clustering, and dimensionality reduction. Only the algorithm names and what predict returns change.

X is your feature matrix — rows are samples, columns are features. y is your target vector — the labels you want to predict.

A Complete Example: Iris Classification

The classic starter dataset. 150 flowers with 4 measurements, classified into 3 species.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train a model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Predict and evaluate
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy:.2%}")
# Accuracy: 100.00%

train_test_split randomly partitions your data. test_size=0.2 reserves 20% for evaluation. random_state=42 makes it reproducible — without it, you’d get different splits every run.

Train-Test Splitting

Good ML practice separates data into train and test sets. You train on one portion, evaluate on another you haven’t seen. This tells you whether your model generalises or just memorised the training data.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.25,       # 25% for testing
    train_size=0.75,      # explicitly set train size (optional)
    random_state=0,       # reproducibility seed
    stratify=y            # keep class proportions equal in both sets
)

stratify=y is important when your classes are imbalanced. Without it, your test set might accidentally contain zero examples of a rare class.

Choosing a Model

Start simple. For most tabular classification problems:

LogisticRegression — fast, interpretable, works well as a baseline
DecisionTreeClassifier — easy to visualise, prone to overfitting
RandomForestClassifier — ensemble of trees, usually better accuracy
SVC (Support Vector Machine) — strong on medium-sized datasets

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

Checking Feature Names

When you load from a dataset utility, you can inspect what each column means:

print(iris.feature_names)
# ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

print(iris.target_names)
# ['setosa' 'versicolor' 'virginica']

This matters when you write preprocessing code or interpret model coefficients.

Inspecting a Trained Model

Once fitted, models store what they learned in attributes ending with _:

print(model.n_features_in_)      # how many features the model expects
# 4

print(model.feature_names_in_)    # original feature names (if provided)
# ['sepal length (cm)', 'sepal width (cm)', ...]

print(model.classes_)             # class labels the model predicts
# [0 1 2]

# For linear models: coefficient weights
print(model.coef_)                # shape: (n_classes, n_features)

Cross-Validation

A single train-test split might give you a lucky or unlucky split. Cross-validation runs multiple splits and averages the scores:

from sklearn.model_selection import cross_val_score

model = LogisticRegression(max_iter=200, random_state=42)

scores = cross_val_score(model, X, y, cv=5)
print(f"CV scores: {scores}")
# [1.         0.96666667 0.93333333 0.96666667 1.        ]

print(f"Mean: {scores.mean():.2%}, Std: {scores.std():.2%}")
# Mean: 97.33%, Std: 2.71%

cv=5 runs 5-fold cross-validation. Each fold takes a different 20% as the test set. The mean and standard deviation tell you how stable the performance is.

Evaluation Metrics

accuracy_score is the simplest metric, but it hides class imbalance issues:

from sklearn.metrics import classification_report, confusion_matrix

predictions = model.predict(X_test)

print(confusion_matrix(y_test, predictions))
# [[10  0  0]
#  [ 0  9  0]
#  [ 0  1 10]]

print(classification_report(y_test, predictions))
#               precision    recall  f1-score   support
# 
#            0       1.00      1.00      1.00        10
#            1       0.90      1.00      0.95         9
#            2       1.00      0.91      0.95        11
# 
#     accuracy                           0.97        30
#    macro avg       0.97      0.97      0.97        30
# weighted avg       0.97      0.97      0.97        30

classification_report gives precision, recall, and F1-score per class. confusion_matrix shows which classes get confused with which.

A Realistic Pipeline

Most real workflows include feature scaling and many preprocessing steps. Pipeline chains them together so preprocessing happens automatically during cross-validation:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

pipeline = Pipeline([
    ("scaler", StandardScaler()),   # normalise features to mean=0, std=1
    ("classifier", SVC(random_state=42))
])

pipeline.fit(X_train, y_train)
print(f"Test accuracy: {pipeline.score(X_test, y_test):.2%}")

StandardScaler subtracts the mean and divides by standard deviation for each feature. Many algorithms (SVM, logistic regression, neural networks) perform better when features are on similar scales.

Hyperparameter Tuning

Models have settings that affect performance. GridSearchCV tests all combinations:

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

param_grid = {
    "n_estimators": [50, 100, 200],
    "max_depth": [3, 5, 10, None],
    "min_samples_split": [2, 5, 10]
}

model = RandomForestClassifier(random_state=42)

search = GridSearchCV(
    model, param_grid, cv=5, scoring="accuracy", n_jobs=-1
)
search.fit(X_train, y_train)

print(search.best_params_)
# {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 50}

print(f"Best CV score: {search.best_score_:.2%}")
# Best CV score: 97.50%

print(f"Test score: {search.score(X_test, y_test):.2%}")
# Test score: 100.00%

n_jobs=-1 uses all CPU cores. scoring="accuracy" optimises for correct predictions. Swap it for "f1_macro" or "roc_auc" depending on your problem.