🤖 Machine Learning · scikit-learn · Python

Classification & Regression
with scikit-learn

Two classic sklearn toy datasets, two different ML paradigms. Built a K-Nearest Neighbors classifier on the Iris dataset to predict flower species from petal measurements, then trained a linear regression model on the Diabetes dataset to forecast disease progression from BMI. Real data, real models — explored, evaluated, and visualized from scratch.

150Iris Samples
442Diabetes Samples
96%KNN Accuracy
2ML Models
R² 0.34Regression Score

Iris Flower Classifier

150 samples across three iris species (Setosa, Versicolor, Virginica), each described by four measurements. Used K-Nearest Neighbors (k=3) to classify species. Toggle axes below to explore how different feature combinations separate the clusters — petal dimensions are the clearest signal.

X Axis
Y Axis
Setosa (50)
Versicolor (50)
Virginica (50)
KNN Confusion Matrix
Setosa
Versic.
Virgin.
Setosa
50
0
0
Versic.
0
47
3
Virgin.
0
3
47
✓ 96% Accuracy — 144/150 correct

Only Versicolor ↔ Virginica get confused — those two species overlap in petal space. Setosa is perfectly separable.


Diabetes Disease Progression

442 patient records with 10 baseline features (age, sex, BMI, blood pressure, and 6 serum measurements). Fitted a linear regression model using BMI as the predictor — the single feature with the strongest correlation to one-year disease progression. The model explains about 34% of the variance (R² = 0.34), which is solid for a single-feature linear fit on noisy medical data.

R² Score
0.344
Model explains 34.4% of variance
Coefficient
949.4
+1 std BMI → +949 progression units
Intercept
152.1
Baseline at mean BMI
Feature: s6
Blood Sugar
Serum measurement — glucose level proxy

The Python

Full working implementations — classification and regression, top to bottom.

# Iris Classification with K-Nearest Neighbors
# Dataset: 150 samples, 3 species, 4 features

from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

iris = load_iris()
X, y = iris.data, iris.target

# Shape of the data
print("Data shape:", X.shape)       # (150, 4)
print("Target shape:", y.shape)     # (150,)
print("Target names:", iris.target_names)
# → ['setosa' 'versicolor' 'virginica']

# Train KNN classifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X, y)
predicted = knn.predict(X)

# First 10 predicted vs expected (using species names)
print("\nFirst 10 results:")
for i in range(10):
    print(f"  Predicted: {iris.target_names[predicted[i]]:<12}"
          f"Expected: {iris.target_names[y[i]]}")

# Values the model got wrong
wrong = [(i, iris.target_names[predicted[i]], iris.target_names[y[i]])
         for i in range(len(y)) if predicted[i] != y[i]]

print(f"\nWrong predictions ({len(wrong)}):")
for idx, pred, exp in wrong:
    print(f"  Index {idx}: Predicted '{pred}', Expected '{exp}'")

# Confusion matrix visualization
cm = confusion_matrix(y, predicted)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                               display_labels=iris.target_names)
disp.plot(cmap='Blues')
plt.title("Iris KNN Confusion Matrix (k=3)")
plt.tight_layout()
plt.show()
# Diabetes Regression with Linear Regression
# Dataset: 442 patients, 10 features, target = 1-year disease progression

import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn import datasets

diabetes = datasets.load_diabetes()
X, y = diabetes.data, diabetes.target

# How many samples and features?
print("Samples:", X.shape[0])   # 442
print("Features:", X.shape[1])  # 10

# Feature s6 represents blood glucose level (serum measurement)
print("Feature names:", diabetes.feature_names)

# Isolate BMI (feature index 2) — strongest single predictor
bmi = X[:, 2].reshape(-1, 1)

# Fit linear regression
reg = LinearRegression()
reg.fit(bmi, y)

# Print coefficient and intercept
print(f"Coefficient: {reg.coef_[0]:.2f}")    # 949.44
print(f"Intercept:   {reg.intercept_:.2f}")  # 152.13
print(f"R² Score:    {reg.score(bmi, y):.4f}") # 0.3439

# Scatterplot with regression line
bmi_range = np.linspace(bmi.min(), bmi.max(), 100).reshape(-1, 1)
y_pred = reg.predict(bmi_range)

plt.figure(figsize=(8, 5))
plt.scatter(bmi, y, color='steelblue', alpha=0.4, s=20, label='Actual')
plt.plot(bmi_range, y_pred, color='red', linewidth=2, label='Regression Line')
plt.xlabel("BMI (standardized)")
plt.ylabel("Disease Progression (1 year)")
plt.title("Diabetes: BMI vs Disease Progression")
plt.legend()
plt.tight_layout()
plt.show()

By the Numbers

Business Problem
Healthcare data is abundant but often underused. The challenge: can a simple model trained on basic patient measurements predict who is at higher risk for diabetes complications? And for classification tasks like species identification, how cleanly can geometric measurements separate distinct categories — without any deep learning overhead?
One-Sentence Summary
Two sklearn toy datasets, two modeling approaches. Trained a K-Nearest Neighbors classifier (k=3) on the Iris dataset achieving 96% accuracy, and a Linear Regression model on the Diabetes dataset with R²=0.344. Explored confusion matrices, coefficients, intercepts, and feature importance through visualization.
Tools & Libraries
scikit-learn Python 3.11 matplotlib numpy KNeighborsClassifier LinearRegression ConfusionMatrixDisplay
Key Features
KNN species classifier with interactive axis toggling to visualize feature separation. Confusion matrix revealing where versicolor and virginica overlap. Linear regression isolating BMI as the strongest diabetes predictor. Complete metric reporting: accuracy, coefficient, intercept, R², and misclassification list.
My Role
Sole developer — loaded and explored both datasets from scratch, chose appropriate model architectures, tuned hyperparameters (k value), interpreted all model outputs, and built the visualizations. Worked through the feature selection decision for the diabetes regression (BMI vs trying all 10 features).
Biggest Challenge
Understanding why the confusion matrix showed Versicolor ↔ Virginica errors even at 96% accuracy required going back to the scatter plots. Petal length vs. petal width makes the overlap visible — they genuinely share measurement space. That was the moment sklearn clicked: models are only as good as the separability in your data.
What I Learned
The difference between classification and regression isn't just syntax — it's a fundamentally different question being asked. I also learned to read R² critically: 0.34 on noisy medical data with one feature is actually meaningful, not "bad." Feature selection matters more than model complexity at this scale.
Course Context
Built for AI: Principles and Application (4V98) at Baylor University. This project introduced supervised machine learning concepts using scikit-learn's built-in datasets — chosen specifically because they isolate modeling skill from data-cleaning noise. The iris dataset is a classification benchmark; the diabetes dataset introduces real-world regression complexity with overlapping, correlated features.
GitHub & Demo
⌥ GitHub Repository ↗ Built in VS Code with GitHub Copilot. Full source includes both model scripts, the interactive chart visualizations embedded on this page, and inline comments explaining each sklearn step. Available on request.