Evaluation Methodology¶
This page describes riverbank's external evaluation methodology using Wikidata as ground truth, introduced in v0.15.0 and refined with the v0.15.1 improvement loop.
Overview¶
riverbank's extraction quality is measured by comparing compiled triples against Wikidata's curated statements for the same Wikipedia articles. Wikidata is chosen because it is:
- Large — 1.65 billion statements from 110 million items
- Sourced from Wikipedia — the same articles riverbank ingests
- Human-curated — each statement has at least one reference
- Structured — typed properties (P-ids) enable automated matching
The evaluation pipeline is reproducible and fully automated. Results are stored
in eval/results/ and are never committed to the repository.
Benchmark Dataset¶
The benchmark is defined in eval/wikidata-benchmark-1k.yaml and contains
1,000 Wikipedia articles stratified across 7 domains:
| Domain | Articles | Description |
|---|---|---|
biography_living |
150 | Living notable persons |
biography_historical |
200 | Deceased notable persons |
organization |
150 | Companies, NGOs, governments |
geographic |
150 | Cities, rivers, mountains, regions |
creative_work |
150 | Films, novels, artworks, albums |
scientific |
100 | Theories, phenomena, discoveries |
event |
100 | Wars, disasters, cultural events |
Stratification ensures that no single domain dominates the aggregate metrics.
Pipeline Stages¶
1. Article Fetch (WikipediaClient)¶
Each article is fetched via the MediaWiki REST API and converted to Markdown.
A local hybrid cache (.riverbank/article_cache/) avoids redundant network
calls:
- Cache hit — metadata TTL checked (default 30 days); served from disk
- Cache miss — fetched fresh, cached to disk for future runs
--no-cache— bypass local cache entirely (force fresh)--cache-only— raiseCacheOnlyErrorif article not in cache
2. Ground-Truth Fetch (WikidataClient)¶
The Wikidata SPARQL endpoint (WDQS) is queried for all statements of the corresponding Wikidata item, identified via sitelink from the Wikipedia title.
Exclusion filters — statements are excluded if their value type is:
| Excluded type | Reason |
|---|---|
ExternalId |
Database identifiers (ISNI, VIAF, etc.) |
CommonsMedia |
Image filenames |
Url |
Website URLs |
GeoShape, TabularData |
Complex geodata blobs |
Math |
Mathematical formulae |
This focuses evaluation on factual, extractable content.
3. Property Alignment (PropertyAlignmentTable)¶
riverbank predicates are matched to Wikidata P-ids via the alignment table
defined in property-alignment-v1.yaml and implemented in
src/riverbank/eval/property_alignment.py.
The table currently covers 50+ properties including:
| P-id | Label | riverbank predicates |
|---|---|---|
| P31 | instance of | rdf:type, pgc:isA |
| P106 | occupation | pgc:hasOccupation |
| P569 | date of birth | pgc:birthDate |
| P27 | country of citizenship | pgc:nationality, ex:citizenship |
| P159 | headquarters location | pgc:headquartersLocation |
4. Entity Resolution (EntityResolver)¶
riverbank IRIs are linked to Wikidata Q-ids through a three-stage pipeline:
- Sitelink match — if the IRI label matches the article title, use the article's Q-id directly (confidence 1.0)
- Label match — extract a human-readable label from the IRI; fuzzy-match against Wikidata entity labels and aliases
- Context disambiguation — when multiple candidates have similar scores, filter by P31 (instance of) type using domain hints
5. Scoring (Scorer)¶
Each riverbank triple (subject, predicate, object, confidence) is classified:
| Match type | Meaning | Counted as |
|---|---|---|
exact |
Predicate aligned and object matches | True positive (TP) |
partial |
Predicate aligned but object doesn't match | False positive (FP) |
no_match |
Predicate not in alignment table | Novel discovery candidate |
Object matching uses:
- Exact string normalisation (lowercase, punctuation removed)
- Year extraction from ISO 8601 dates (year-level match → 0.95 score)
- Fuzzy string similarity via
rapidfuzz(ordifflibfallback) - Q-id label lookup for Wikidata items
Precision, recall, and F1 are computed as:
$$\text{Precision} = \frac{TP}{TP + FP}$$
$$\text{Recall} = \frac{TP}{TP + FN}$$
$$F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$$
6. Confidence Calibration¶
Each triple's confidence score is bucketed into four ranges
(0.0–0.25, 0.25–0.5, 0.5–0.75, 0.75–1.0) and observed accuracy
within each bucket is measured. Calibration quality is reported as
Pearson ρ between bucket midpoints and observed accuracy.
A well-calibrated model should produce ρ ≥ 0.80: higher confidence triples should be more accurate.
Metrics¶
| Metric | Target | Description |
|---|---|---|
| Precision | ≥ 0.85 | Fraction of riverbank triples that match a Wikidata statement |
| Recall | ≥ 0.60 | Fraction of Wikidata statements captured by riverbank |
| F1 | ≥ 0.70 | Harmonic mean of precision and recall |
| Calibration ρ | ≥ 0.80 | Pearson correlation of confidence vs. observed accuracy |
| Novel discovery rate | — | Fraction of unmatched triples that are factually correct |
The novel discovery rate (NDR) is validated by manual annotation:
unmatched triples are sampled (10% by default) and classified as
correct, incorrect, uncertain, or in_wikidata (alignment gap).
$$\text{NDR} = \frac{|\text{correct}|}{|\text{correct}| + |\text{incorrect}|}$$
v0.15.0 Baseline Results¶
The first evaluation run over all 1,000 articles established the baseline:
| Metric | Value |
|---|---|
| Precision | 0.87 |
| Recall | 0.62 |
| F1 | 0.72 |
| Calibration ρ | 0.83 |
| Novel discovery rate | 0.78 |
All four exit criteria from the v0.15.0 roadmap were met.
v0.15.1 Improvement Loop¶
v0.15.1 closes the feedback loop from the evaluation back into the extraction pipeline.
Per-Property Recall Gap Analysis¶
The RecallGapAnalyzer class (in src/riverbank/eval/recall_gap.py) identifies
Wikidata properties where recall falls below a configurable threshold (default
0.50) and generates targeted extraction examples for each gap property.
Run from the CLI:
riverbank recall-gap-analysis --results eval/results/latest.json \
--threshold 0.50 \
--output eval/results/recall-gaps.json
Extraction Prompt Tuning¶
The PromptTuner class (in src/riverbank/eval/prompt_tuning.py) analyses
false-positive and false-negative patterns from the evaluation report and
generates concrete prompt patches — additional few-shot examples and system
instructions — to improve precision and recall.
Run from the CLI:
riverbank tune-extraction-prompts --results eval/results/latest.json \
--output eval/results/tuning-report.json
Novel Discovery Annotations¶
212 unmatched riverbank triples from the v0.15.0 run were manually annotated
and stored in eval/novel-discovery-annotations.yaml. The validated NDR is
0.779 (134 correct out of 172 judged):
| Verdict | Count |
|---|---|
| Correct | 134 |
| Incorrect | 38 |
| Uncertain | 24 |
| In Wikidata (alignment gap) | 16 |
Alignment gap discoveries (16 triples) directly informed property table extensions in v0.15.1.
Running an Evaluation¶
Single article¶
Full benchmark dataset¶
riverbank evaluate-wikidata \
--dataset eval/wikidata-benchmark-1k.yaml \
--profile wikidata-eval-v1 \
--output eval/results/run-$(date +%Y%m%d).json
Recall gap analysis¶
riverbank recall-gap-analysis \
--results eval/results/latest.json \
--threshold 0.50 \
--output eval/results/recall-gaps.json
Prompt tuning report¶
riverbank tune-extraction-prompts \
--results eval/results/latest.json \
--output eval/results/tuning-report.json
Reproducibility¶
All evaluation runs are fully reproducible:
- The benchmark dataset YAML is committed to the repository
- The property alignment table is committed (
property-alignment-v1.yaml) - The evaluation profile YAML is committed (
examples/profiles/wikidata-eval-v1.yaml) - The Wikipedia article cache is local and persisted across runs
- LLM calls use temperature 0 by default when scoring
Wikidata statements may change over time. Evaluation runs should record the run date; results older than 90 days should be re-run against fresh Wikidata data.
Limitations¶
- Novel discoveries require manual annotation; the automated NDR estimate uses heuristics
- Entity resolution for ambiguous IRIs falls back to label matching, which can introduce noise
- Object matching for quantities, coordinates, and dates uses heuristics; unit normalisation is not exhaustive
- The benchmark covers English Wikipedia only
- Wikidata completeness varies by domain; recall is penalised for statements that Wikidata itself is missing