Classifiers | Sutro Handbook

Classifiers are perhaps the oldest form of AI in existence. The original perceptron demonstration in 1957 (the first neural network) was itself a binary classifier.

Nearly 70 years later, classification remains one of the most important decision models in existence, and underpins the operations of nearly all AI systems in some capacity.

Why humans love classifiers

For one thing, they're simple to understand. Humans easily process discrete decision units, and it's easy to model a decision problem as a classification problem. Should I walk or drive to work? Should I buy a red, blue, or yellow shirt? Should I hire this person or not?

Another reason: the math is simple. Accuracy is easy to understand - (8/10 correct). Precision/recall and f1 scores are a bit more complex, but foundational for data scientists.

Lastly, they're flexible. The various types of classifiers extend the model to cover nearly any type of real-world decision. We'll cover a few of the canonical ones in this guide.

Why humans hate classifiers

Classification is often the first time someone has to swallow a hard truth about machine learning: the fact that classifiers not always correct is a feature, not a bug.

If you're making some kind of decision or prediction that can be 100% correct by design, that means you should be building some kind of deterministic filter, not a predictive classifier. If there is no deterministic filter you can use, yet you found yourself with a 100% correct classifiers on some sample data - you can reasonably expect it to be overfit to that data and not generalize well to data out of sample.

The idea that a system not working 100% of the time is where many software engineers making a jump to AI engineering often have to make a mental leap. So the goal has to shift: instead of looking for 100% reliability, can you reset expectations to producing a model that is roughly a good as, or better than a human expert at that same task?

Where "AI" meets classification

As we mentioned, classifiers far predate modern notions of AI. But what shifted is how they're built, and where time is best spent optimizing their performance.

The rise of foundation models largely changed the way new classification systems get built. In the last generation of machine learning, the goal was to curate large corpuses of training data to create a model from scratch. The architectures were well known, but most of the work revolved around curating the training data itself.

But foundation models are already pre-trained, meaning someone else did the work of curating data for you. Some will be relevant to the task at hand, some not. The higher costs for inference reflect what effectively amounts to the utilization of the sum of that useful and not-so-useful pile of data all at once. But regardless, the task shifts from "pre-training" the model, to "post-training" the model - effectively a steering exercise.

This means, depending on task complexity, that you only need a small amount of data to steer the model towards the behavior you want. And the good news: this can almost always be done in-context, meaning you don't really need to update the weights of the foundation model itself, but rather just tell it how you want it to behave via prompting or other in-context learning methods we'll discuss in this guide.

When to (and not to) use Foundation models for classification tasks

Some classification problems are inherently simple, and more or less objective. As we mentioned earlier in the guide, much of the goal of analytical AI is to compress a task into the smallest model possible that performs well on its evals and production performance tradeoffs.

The core tradeoff is whether the task benefits enough from foundation model capability to justify paying for inference instead of spending time curating data for a smaller, narrower system:

Decision pressure	Foundation models are less justified when...	Foundation models are more justified when...
Time to curate data	Useful labels are easy to produce, the task is stable, and a small dataset can describe the decision boundary well.	Labels are expensive, sparse, or slow to collect, and the model's pre-training can substitute for a large task-specific corpus.
Inference cost	The task runs at extremely high volume, has tight latency constraints, or the decision is low-value from a business perspective.	The classification is high-value, or important enough that the marginal cost of inference is acceptable.
Task complexity	The label can be assigned from obvious features, deterministic filters, keyword rules, embeddings, or a small classifier.	The model must handle messy inputs, many edge cases, subjective criteria, or shifting definitions of what the label means.
Need for autoregressive reasoning	The task is pattern recognition, not reasoning; step-by-step generation adds cost and variance without improving the decision.	The model must interpret context, weigh ambiguous evidence, follow nuanced instructions, or explain why a label applies.
Need for re-training	Models will be "stable" and not frequently updated against new task understanding or data drift	The task definition is frequently changing or the data distribution continually shifts

Generally speaking, more and more teams are reaching for foundation models instead of traditional ML for classification. As foundation models continue to improve and inference costs decrease, the justification for time spent on curating data rather than just paying higher inference costs is not seen as worthwhile (ML engineers and expert annotators are expensive!)

Why humans love classifiers

Why humans hate classifiers

Where "AI" meets classification

When to (and not to) use Foundation models for classification tasks

In This Section