A common theme you'll find in this guide is that we prefer creating models with verifiable "correctness", or at least an expert-annotated dataset. This enables us to establish a ground-truth, and give us some sort of scoring rubric to improve against.
For closed-set fields, verification often looks more like classifier evaluation.
Free-form text similarity
Extractors can be trickier in this regard, because they're typically allowed to generate free-form text by design. To solve this, we recommend using embeddings which are cheap to run, and widely available as open-source models. The sentence-transformers Github package is typically plenty sufficient for such purposes, especially when text spans are short.
Using embeddings for similarity ranking isn't perfect, but using cosine similarity or similar scoring methodologies should yield a score between 0 and 1 that should be sufficient for most measurement purposes.