Scikit-Learn ML Models

Dataset 1 · Classification

Iris Flower Classifier

150 samples across three iris species (Setosa, Versicolor, Virginica), each described by four measurements. Used K-Nearest Neighbors (k=3) to classify species. Toggle axes below to explore how different feature combinations separate the clusters — petal dimensions are the clearest signal.

X Axis

Y Axis

Setosa (50)

Versicolor (50)

Virginica (50)

KNN Confusion Matrix

Setosa

Versic.

Virgin.

Setosa

Versic.

Virgin.

✓ 96% Accuracy — 144/150 correct

Only Versicolor ↔ Virginica get confused — those two species overlap in petal space. Setosa is perfectly separable.

Dataset 2 · Regression

Diabetes Disease Progression

442 patient records with 10 baseline features (age, sex, BMI, blood pressure, and 6 serum measurements). Fitted a linear regression model using BMI as the predictor — the single feature with the strongest correlation to one-year disease progression. The model explains about 34% of the variance (R² = 0.34), which is solid for a single-feature linear fit on noisy medical data.

R² Score

0.344

Model explains 34.4% of variance

Coefficient

949.4

+1 std BMI → +949 progression units

Intercept

152.1

Baseline at mean BMI

Feature: s6

Blood Sugar

Serum measurement — glucose level proxy

Source Code

The Python

Full working implementations — classification and regression, top to bottom.

# Iris Classification with K-Nearest Neighbors
# Dataset: 150 samples, 3 species, 4 features

from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

iris = load_iris()
X, y = iris.data, iris.target

# Shape of the data
print("Data shape:", X.shape)       # (150, 4)
print("Target shape:", y.shape)     # (150,)
print("Target names:", iris.target_names)
# → ['setosa' 'versicolor' 'virginica']

# Train KNN classifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X, y)
predicted = knn.predict(X)

# First 10 predicted vs expected (using species names)
print("\nFirst 10 results:")
for i in range(10):
    print(f"  Predicted: {iris.target_names[predicted[i]]:<12}"
          f"Expected: {iris.target_names[y[i]]}")

# Values the model got wrong
wrong = [(i, iris.target_names[predicted[i]], iris.target_names[y[i]])
         for i in range(len(y)) if predicted[i] != y[i]]

print(f"\nWrong predictions ({len(wrong)}):")
for idx, pred, exp in wrong:
    print(f"  Index {idx}: Predicted '{pred}', Expected '{exp}'")

# Confusion matrix visualization
cm = confusion_matrix(y, predicted)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                               display_labels=iris.target_names)
disp.plot(cmap='Blues')
plt.title("Iris KNN Confusion Matrix (k=3)")
plt.tight_layout()
plt.show()

# Diabetes Regression with Linear Regression
# Dataset: 442 patients, 10 features, target = 1-year disease progression

import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn import datasets

diabetes = datasets.load_diabetes()
X, y = diabetes.data, diabetes.target

# How many samples and features?
print("Samples:", X.shape[0])   # 442
print("Features:", X.shape[1])  # 10

# Feature s6 represents blood glucose level (serum measurement)
print("Feature names:", diabetes.feature_names)

# Isolate BMI (feature index 2) — strongest single predictor
bmi = X[:, 2].reshape(-1, 1)

# Fit linear regression
reg = LinearRegression()
reg.fit(bmi, y)

# Print coefficient and intercept
print(f"Coefficient: {reg.coef_[0]:.2f}")    # 949.44
print(f"Intercept:   {reg.intercept_:.2f}")  # 152.13
print(f"R² Score:    {reg.score(bmi, y):.4f}") # 0.3439

# Scatterplot with regression line
bmi_range = np.linspace(bmi.min(), bmi.max(), 100).reshape(-1, 1)
y_pred = reg.predict(bmi_range)

plt.figure(figsize=(8, 5))
plt.scatter(bmi, y, color='steelblue', alpha=0.4, s=20, label='Actual')
plt.plot(bmi_range, y_pred, color='red', linewidth=2, label='Regression Line')
plt.xlabel("BMI (standardized)")
plt.ylabel("Disease Progression (1 year)")
plt.title("Diabetes: BMI vs Disease Progression")
plt.legend()
plt.tight_layout()
plt.show()

Project Breakdown

By the Numbers

Business Problem

Healthcare data is abundant but often underused. The challenge: can a simple model trained on basic patient measurements predict who is at higher risk for diabetes complications? And for classification tasks like species identification, how cleanly can geometric measurements separate distinct categories — without any deep learning overhead?

One-Sentence Summary

Two sklearn toy datasets, two modeling approaches. Trained a K-Nearest Neighbors classifier (k=3) on the Iris dataset achieving 96% accuracy, and a Linear Regression model on the Diabetes dataset with R²=0.344. Explored confusion matrices, coefficients, intercepts, and feature importance through visualization.

Tools & Libraries

scikit-learn Python 3.11 matplotlib numpy KNeighborsClassifier LinearRegression ConfusionMatrixDisplay

Key Features

KNN species classifier with interactive axis toggling to visualize feature separation. Confusion matrix revealing where versicolor and virginica overlap. Linear regression isolating BMI as the strongest diabetes predictor. Complete metric reporting: accuracy, coefficient, intercept, R², and misclassification list.

My Role

Sole developer — loaded and explored both datasets from scratch, chose appropriate model architectures, tuned hyperparameters (k value), interpreted all model outputs, and built the visualizations. Worked through the feature selection decision for the diabetes regression (BMI vs trying all 10 features).

Biggest Challenge

Understanding why the confusion matrix showed Versicolor ↔ Virginica errors even at 96% accuracy required going back to the scatter plots. Petal length vs. petal width makes the overlap visible — they genuinely share measurement space. That was the moment sklearn clicked: models are only as good as the separability in your data.

What I Learned

The difference between classification and regression isn't just syntax — it's a fundamentally different question being asked. I also learned to read R² critically: 0.34 on noisy medical data with one feature is actually meaningful, not "bad." Feature selection matters more than model complexity at this scale.

Course Context

Built for AI: Principles and Application (4V98) at Baylor University. This project introduced supervised machine learning concepts using scikit-learn's built-in datasets — chosen specifically because they isolate modeling skill from data-cleaning noise. The iris dataset is a classification benchmark; the diabetes dataset introduces real-world regression complexity with overlapping, correlated features.

GitHub & Demo

⌥ GitHub Repository ↗ Built in VS Code with GitHub Copilot. Full source includes both model scripts, the interactive chart visualizations embedded on this page, and inline comments explaining each sklearn step. Available on request.

Classification & Regressionwith scikit-learn

Iris Flower Classifier

Diabetes Disease Progression

The Python

By the Numbers

Classification & Regression
with scikit-learn