Types of Classifiers | Sutro Handbook

For the purposes of this guide, we won't cover every type of classifier - just the ones that are relevant to what we've seen in working with customers.

The first design choice is the shape of the output. Most classifier designs fall into a few common patterns:

Type	Output	Use when...
Binary	One of two labels	The decision is naturally yes/no, pass/fail, relevant/irrelevant, or otherwise reducible to two outcomes.
Multi-class	Exactly one label from a fixed set	The categories are mutually exclusive, and every input should map to one best answer.
Multi-label	Any subset of labels from a fixed set	Multiple labels can apply at once, such as user intents, content topics, risk flags, or product attributes.
Hierarchical	One or more labels across levels of a taxonomy	The decision has a natural parent-child structure, such as support category -> issue type -> root cause.
Open-set	A known label, `other`, or a proposed new label	The taxonomy is still evolving, or you expect meaningful inputs that do not fit the current label set.
Ordinal / scoring	An ordered label, score band, or Likert-style rating	The output has meaningful order, such as severity, quality, risk, urgency, or degree of fit.

Binary

The "hello world" of classification. Binary classifiers make one of two possible decisions. If the task can honestly be represented as a yes/no question, start here.

Use binary classifiers when the decision boundary is narrow and easy to reason about. Common label sets include:

true / false
pass / fail
relevant / irrelevant
should_escalate / should_not_escalate

Avoid binary classifiers when there is a meaningful third state. If a human expert would often say "unclear", "not enough information", or "partially", forcing the model into a binary label set will usually produce false certainty, or force the model into a decision when it's actually unsure.

Multiclass

Multiclass classifiers choose exactly one label from a closed set of possible labels. Often, this is the next step up from a binary classifier, or a way to extend a binary decision to include a third class.

Use multiclass classifiers when the labels are mutually exclusive and each input should have one best answer. Some examples:

true, false, unclear
pass, fail, insufficient_evidence
billing_issue, technical_issue, account_issue, product_feedback

Avoid multiclass classifiers when labels can reasonably overlap. For example, a support ticket can be both a billing issue and a cancellation risk. Forcing one label may make downstream routing simpler, but it can erase information the business actually needs.

Multilabel

Multilabel classifiers choose any subset of labels from a closed set. They are best when one categorical choice will not sufficiently express the decision boundary.

Use multilabel classifiers for intent detection, topic tagging, risk tagging, product attributes, and other cases where multiple things can be true at once. An example label set might look like:

user_asked_follow_up_question
user_not_satisfied
user_submitted_claim
user_filed_case_report
user_requested_human_support_agent

Avoid multilabel classifiers when the label set is really a hierarchy or when one label should logically exclude another. In those cases, a multilabel task can produce combinations that look valid syntactically but do not make sense operationally.

Multilabel classifiers can also be harder to score precisely, since label coincidence is not truly independent. For example, one label may be more frequently chosen when another label is also chosen. This means aggregate accuracy is often less useful than per-label precision, recall, and review of common co-occurrences.

Hierarchical

Hierarchical classifiers assign labels across levels of a taxonomy. For example, a support classifier might assign billing -> refund -> duplicate_charge, rather than choosing only one flat category.

Use hierarchical classifiers when the domain already has a meaningful parent-child structure. They are useful for support taxonomies, product taxonomies, policy categories, medical or legal issue trees, and other settings where broad categories break down into narrower subtypes.

There are two common implementation patterns:

Predict every level at once, such as category, subcategory, and root cause.
Predict stepwise, where the first classifier picks the parent category and later classifiers pick narrower children.

Open set

This is a special type of multiclass, or (typically) multilabel classifier that actually does not begin with a fixed (closed) set of labels that a model can choose from. Rather, it can suggest new labels and apply them when a label in the set previously doesn't apply.

Open set labeling is a special case of classification that requires delicate post-processing, and can be particularly risky as it carries more model-induced biases than a closed-set task.

Use open-set classifiers when the taxonomy is still evolving, or when you are mining a dataset to discover latent clusters before committing to a closed label set.

Avoid open-set classifiers in production workflows that require stable reporting, routing, or enforcement. A model that can invent labels can also invent slightly different versions of the same label, encode its own biases into the taxonomy, or create categories that are hard for humans to interpret.

Open-set outputs usually need post-processing. At minimum, you should review proposed labels, merge duplicates, normalize wording, and decide which new labels deserve to become part of the closed set.

The most common failure mode is treating model-suggested labels as ground truth. Open-set labeling is useful for discovery, but the taxonomy still needs human ownership.

For production use, open-set labels usually need expert review before becoming part of a stable taxonomy.

Ordinal / scoring

Ordinal classifiers assign one label from an ordered set. The labels are discrete, but their order matters.

Use ordinal classifiers when the task is not just asking what kind of thing something is, but how much of some property it has.

Common output shapes include:

low / medium / high risk
poor / fair / good / excellent
1 / 2 / 3 / 4 / 5
not_urgent / somewhat_urgent / very_urgent

Avoid ordinal classifiers when the labels imply more precision than the model or rubric can support. A 1-5 (Likert) score is only useful if humans can reliably agree on what each number means.