Machine Learning Basics
Supervised and unsupervised learning, training, and inference for developers
Overview
Machine learning (ML) teaches computers to improve at a task through experience (data) instead of hand-written rules. You provide examples; an algorithm adjusts internal parameters to minimize error on those examples, then generalizes to unseen inputs.
Two dominant paradigms:
- Supervised learning: each example has a label (spam/not spam, price, category). The model learns input → label mappings.
- Unsupervised learning: no labels. The model finds structure—clusters, anomalies, or compressed representations.
Training fits the model; inference applies it to new data in production.
Syntax / Usage
Typical supervised workflow:
1. Collect & clean data
2. Split: train / validation / test
3. Choose model & loss function
4. Train (optimize weights on train set)
5. Tune hyperparameters using validation set
6. Evaluate once on held-out test set
7. Deploy for inference
Common supervised tasks:
| Task | Output | Example |
|---|---|---|
| Classification | Discrete label | Fraud yes/no |
| Regression | Continuous number | House price |
| Ranking | Ordered list | Search results |
Unsupervised examples: customer segmentation (clustering), dimensionality reduction (visualization), anomaly detection (unusual server metrics).
Libraries: scikit-learn (tabular ML), PyTorch / TensorFlow (deep learning), Hugging Face (pretrained models).
Examples
Training a simple classifier with scikit-learn:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2)
model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print(accuracy_score(y_test, predictions))
Inference in an API route:
def predict_churn(user_features: list[float]) -> float:
probability = model.predict_proba([user_features])[0][1]
return float(probability)
Common Mistakes
- Evaluating on training data and claiming high accuracy (overfitting)
- Leaking test-set information into feature engineering or tuning
- Using accuracy alone on imbalanced datasets (99% non-fraud → useless model)
- Deploying without monitoring for data drift (inputs change over time)
- Retraining on production logs without privacy review or label quality checks
See Also
ai-fundamentals neural-networks embeddings