NER Comparison & Annotation Tool

Compare named-entity recognition (NER) model outputs side by side, score them, find disagreements, and build labeled gold datasets by hand or by correcting model predictions. JSON in, JSON out — no account, your data stays in the browser.

The two modes

Compare model outputs — upload a JSON file containing one or more modelResponses per example and review them side by side. The tool computes per-model precision, recall, and F1 against any humanAnnotations, surfaces disagreements, and exposes a cross-model label confusion matrix.
Annotate text — start a new project (or upload existing annotations) and label spans by hand. Each annotation goes into humanAnnotations; export the file to keep your gold dataset.

JSON file format

The full schema is at /data-schema.json. Uploads are validated with Ajv at upload time; mismatches surface a path-keyed error. Minimal example:

{
  "schemaVersion": 2,
  "examples": [
    {
      "id": "q1",
      "text": "OpenAI was founded in San Francisco in 2015.",
      "modelResponses": [
        {
          "modelName": "Model A",
          "inferenceTime": 0.045,
          "entities": [
            { "text": "OpenAI", "label": "ORG", "start": 0, "end": 6, "confidence": 0.95 },
            { "text": "San Francisco", "label": "LOC", "start": 22, "end": 35, "confidence": 0.92 }
          ]
        }
      ],
      "humanAnnotations": [
        { "text": "OpenAI", "label": "Organization", "start": 0, "end": 6 }
      ]
    }
  ],
  "modelNames": ["Model A"]
}

Required fields

examples — array of objects.
examples[].id — unique string identifier.
examples[].text — the source text.
examples[].modelResponses — array (may be empty for hand-annotation projects).
examples[].modelResponses[].modelName — string.
examples[].modelResponses[].entities — array of { text, label }.
modelNames — array of model name strings (top-level).

Optional fields

schemaVersion — integer; current schema is 2.
examples[].humanAnnotations — array of gold entities.
examples[].rejectedPredictions — predictions the user has dismissed.
entity.start, entity.end — character offsets into text.
entity.confidence — number in [0, 1].
customLabelColors, savedThemes, labelDefinitions.

Export format

The Export JSON button writes the same shape back out, plus a scores map keyed by example id and model name (1–5 star ratings, category, notes) and a metadata block with totals and per-model precision / recall / F1.

Common tasks — Compare mode

Filter to disagreements or errors-vs-gold to focus your review.
Click "Show label confusion" to find merge candidates across models.
Promote a model prediction to gold (✓), or reject it (✗), inline.
Score each model 1–5 stars per example; the Model Summary aggregates totals.

Common tasks — Annotate mode

Highlight any span in an example to assign a label or create a new one.
Use the Label Editor for batch rename / merge / delete (all examples or filtered subset).
Add per-label guidelines (description + positive / counter examples) for consistent annotation.
Export at any time; the file round-trips back into the tool unchanged.

See also: /llms.txt (markdown mirror for LLM agents) and /example-data.json (full working example).