stackademic

The leading education platform for anyone with an interest in software development.

Machine Learning Basics

Supervised and unsupervised learning, training, and inference for developers

Overview

Machine learning (ML) teaches computers to improve at a task through experience (data) instead of hand-written rules. You provide examples; an algorithm adjusts internal parameters to minimize error on those examples, then generalizes to unseen inputs.

Two dominant paradigms:

  • Supervised learning: each example has a label (spam/not spam, price, category). The model learns input → label mappings.
  • Unsupervised learning: no labels. The model finds structure—clusters, anomalies, or compressed representations.

Training fits the model; inference applies it to new data in production.

Syntax / Usage

Typical supervised workflow:

1. Collect & clean data
2. Split: train / validation / test
3. Choose model & loss function
4. Train (optimize weights on train set)
5. Tune hyperparameters using validation set
6. Evaluate once on held-out test set
7. Deploy for inference

Common supervised tasks:

TaskOutputExample
ClassificationDiscrete labelFraud yes/no
RegressionContinuous numberHouse price
RankingOrdered listSearch results

Unsupervised examples: customer segmentation (clustering), dimensionality reduction (visualization), anomaly detection (unusual server metrics).

Libraries: scikit-learn (tabular ML), PyTorch / TensorFlow (deep learning), Hugging Face (pretrained models).

Examples

Training a simple classifier with scikit-learn:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2)
model = LogisticRegression()
model.fit(X_train, y_train)

predictions = model.predict(X_test)
print(accuracy_score(y_test, predictions))

Inference in an API route:

def predict_churn(user_features: list[float]) -> float:
    probability = model.predict_proba([user_features])[0][1]
    return float(probability)

Common Mistakes

  • Evaluating on training data and claiming high accuracy (overfitting)
  • Leaking test-set information into feature engineering or tuning
  • Using accuracy alone on imbalanced datasets (99% non-fraud → useless model)
  • Deploying without monitoring for data drift (inputs change over time)
  • Retraining on production logs without privacy review or label quality checks

See Also

ai-fundamentals neural-networks embeddings