You shouldn't be collecting annotations for them to collect dust in old spreadsheets. If created and used well, these will become the unstructured gold needed to monotonically improve an analytical AI system over time.
A maximally useful expert annotation has the following schema:
| Field | Capture | Why it matters |
|---|---|---|
| Input | The exact input the model received. | Lets you reproduce the case and understand what context the model had. |
| Model output | The model's response, ideally with a terse rationale justifying its response. | Shows both the behavior and the apparent reasoning behind its behavior. |
| Expert correction | The expert-corrected output, if a correction is necessary. | Provides the target behavior the system should learn. |
| Expert rationale | Why the correction is right, especially when the rationale differs from the model's. | Turns a single example into a reasoning artifact that can be later abstracted into a decision rule. |
| Inference metadata | Model used, system prompt, sampling params, timestamp, and related runtime details. | Keeps the annotation tied to the exact system behavior being reviewed. |
| Expert metadata | Labeler identity, timestamp, and review context. | Supports auditability and disagreement review. |
That's not so scary!
But it is a lot of work to keep these records clean, versioned, and accessible. Sutro helps with this by acting as an annotation store that can be used directly to modify model/agent behavior.