The F1 score is a core metric in machine learning, especially when you’re checking fairness across different demographic groups.

Here’s a clear, practical explanation you can use in your QA workflow.

What F1 Stands For

F1 is short for the F1 Score, which is the harmonic mean of Precision and Recall.

It combines both into a single number so you can judge how well a model balances:

  • Precision — how many predicted positives were actually correct
  • Recall — how many actual positives the model successfully found

The formula is:

F1=2PrecisionRecallPrecision+Recall

Why F1 Matters in Bias Detection

When you evaluate a model across demographic groups, F1 is extremely useful because:

  • It penalizes models that do well on one metric but poorly on the other
  • It gives a balanced view of performance
  • It highlights unequal treatment between groups

For example:

GroupPrecisionRecallF1
Group A0.900.880.89
Group B0.720.550.62

Even if accuracy looks similar, the F1 score exposes that Group B is being treated much worse.

When to Use F1

Use F1 when:

  • Classes are imbalanced
  • False positives and false negatives both matter
  • You want a single fairness metric to compare across groups

It’s especially common in:

  • Fraud detection
  • Hiring models
  • Medical risk prediction
  • Moderation systems
  • Credit scoring

Quick Intuition

Think of F1 as the “fairness‑friendly” metric:

  • Precision alone can hide bias
  • Recall alone can hide bias
  • F1 exposes it