1 minute read

Introduction

I thought logistic regression would be a reliable baseline.
But I got 58% accuracy — not even close to acceptable.

Here’s how I debugged it, what mistakes I found, and what ultimately helped.


1. “Did I Forget to Normalize?”

Logistic regression isn’t distance-based, but feature scaling still matters.
→ Apply StandardScaler:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

📉 Accuracy rose slightly to 60%.
✅ Somewhat better, but still lacking.


2. “Is This Due to Class Imbalance?”

import numpy as np

np.bincount(y_train)
# Output: [870, 130]

💡 Severe imbalance. Most predictions defaulted to class 0.

Fix: Use class_weight='balanced'

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(class_weight='balanced')
model.fit(X_train_scaled, y_train)

📈 Accuracy now 66%. Recall improved significantly.


3. “What If Accuracy Isn’t the Best Metric?”

Check precision, recall, f1 instead:

from sklearn.metrics import classification_report

y_pred = model.predict(X_test_scaled)
print(classification_report(y_test, y_pred))

Insight: Accuracy hid key issues.

  • Precision was okay but recall was low
  • F1-score showed room for threshold tuning

What I Realized

  • Logistic regression is sensitive to preprocessing.
  • Imbalanced data can ruin performance.
  • Accuracy alone is a misleading metric — look at the full picture.

What I Want to Do Next

  • Try threshold tuning for F1 maximization
  • Experiment with SMOTE for resampling
  • Compare logistic regression with tree-based models

The model wasn’t bad —
I just didn’t understand it well enough. Until now.