Confidence consolidation (noisy-OR) with source diversity scoring (v0.12.1).
When the same triple (s, p, o) is extracted from multiple fragments, this
module consolidates the per-extraction confidence values into a single final
confidence using the noisy-OR formula:
c_final = 1 − ∏ᵢ (1 − cᵢ)
Source diversity scoring: Corroboration from multiple fragments of the
same document counts as a single "vote" rather than independent evidence.
This prevents correlated hallucinations from templated or copied documents
from crossing the trusted threshold. Multi-provenance evidence spans are
accumulated per triple so that every source remains traceable.
Usage::
from riverbank.postprocessors.consolidate import NoisyORConsolidator
consolidator = NoisyORConsolidator(trusted_threshold=0.75)
results = consolidator.consolidate(triples_with_fragments)
promoted, remaining = consolidator.split_by_threshold(results)
Data flow (called from riverbank promote-tentative)::
1. Query the tentative graph for all triples
2. Group by normalised (subject, predicate, object_value) key
3. Compute noisy-OR confidence with source-diversity weighting
4. Return ConsolidatedTriple list sorted by final confidence descending
ConsolidatedTriple
dataclass
A triple whose confidence has been consolidated across multiple extractions.
Attributes
subject, predicate, object_value:
The canonical triple components (as extracted — not normalised).
final_confidence:
Noisy-OR confidence after source-diversity weighting.
raw_confidences:
List of per-extraction confidence values before consolidation.
provenance:
List of :class:ProvenanceRecord for every contributing extraction.
source_diversity:
Number of distinct source documents that produced this triple.
Fragments within the same document are de-duplicated before
applying noisy-OR (one vote per document).
Source code in src/riverbank/postprocessors/consolidate.py
| @dataclass
class ConsolidatedTriple:
"""A triple whose confidence has been consolidated across multiple extractions.
Attributes
----------
subject, predicate, object_value:
The canonical triple components (as extracted — not normalised).
final_confidence:
Noisy-OR confidence after source-diversity weighting.
raw_confidences:
List of per-extraction confidence values before consolidation.
provenance:
List of :class:`ProvenanceRecord` for every contributing extraction.
source_diversity:
Number of *distinct* source documents that produced this triple.
Fragments within the same document are de-duplicated before
applying noisy-OR (one vote per document).
"""
subject: str
predicate: str
object_value: str
final_confidence: float
raw_confidences: list[float] = field(default_factory=list)
provenance: list[ProvenanceRecord] = field(default_factory=list)
source_diversity: int = 1
|
NoisyORConsolidator
Consolidate per-fragment confidence values using noisy-OR.
Parameters
trusted_threshold:
Confidence at or above which a consolidated triple is considered
trusted. Default 0.75 matches the per-triple routing threshold from
v0.12.0.
Source code in src/riverbank/postprocessors/consolidate.py
| class NoisyORConsolidator:
"""Consolidate per-fragment confidence values using noisy-OR.
Parameters
----------
trusted_threshold:
Confidence at or above which a consolidated triple is considered
trusted. Default 0.75 matches the per-triple routing threshold from
v0.12.0.
"""
def __init__(self, trusted_threshold: float = 0.75) -> None:
self.trusted_threshold = trusted_threshold
# ------------------------------------------------------------------
# Public API
# ------------------------------------------------------------------
def consolidate(
self,
triples: Sequence[Any],
) -> list[ConsolidatedTriple]:
"""Consolidate a flat list of (possibly duplicate) extracted triples.
Each element of *triples* must expose the attributes:
``subject``, ``predicate``, ``object_value``, ``confidence``,
``evidence`` (with ``.source_iri``, ``.excerpt``), and optionally
``fragment_key``.
Returns a deduplicated list of :class:`ConsolidatedTriple` sorted by
``final_confidence`` descending.
"""
# Group raw extractions by normalised triple key
groups: dict[TripleKey, list[Any]] = {}
for t in triples:
key = _normalise_key(t)
groups.setdefault(key, []).append(t)
results: list[ConsolidatedTriple] = []
for key, group in groups.items():
subj, pred, obj = key
# Pick canonical (non-normalised) values from the highest-confidence instance
best = max(group, key=lambda t: float(getattr(t, "confidence", 0.0)))
canon_subj = getattr(best, "subject", subj)
canon_pred = getattr(best, "predicate", pred)
canon_obj = getattr(best, "object_value", obj)
prov_records, raw_confs, source_diversity = _build_provenance(group)
final_conf = _noisy_or_with_diversity(group)
results.append(
ConsolidatedTriple(
subject=canon_subj,
predicate=canon_pred,
object_value=canon_obj,
final_confidence=round(final_conf, 6),
raw_confidences=raw_confs,
provenance=prov_records,
source_diversity=source_diversity,
)
)
results.sort(key=lambda ct: ct.final_confidence, reverse=True)
return results
def split_by_threshold(
self,
consolidated: Sequence[ConsolidatedTriple],
) -> tuple[list[ConsolidatedTriple], list[ConsolidatedTriple]]:
"""Split consolidated triples into (above_threshold, below_threshold).
Returns ``(trusted_candidates, remaining)`` where ``trusted_candidates``
are those whose ``final_confidence >= trusted_threshold``.
"""
trusted: list[ConsolidatedTriple] = []
remaining: list[ConsolidatedTriple] = []
for ct in consolidated:
if ct.final_confidence >= self.trusted_threshold:
trusted.append(ct)
else:
remaining.append(ct)
return trusted, remaining
|
consolidate(triples)
Consolidate a flat list of (possibly duplicate) extracted triples.
Each element of triples must expose the attributes:
subject, predicate, object_value, confidence,
evidence (with .source_iri, .excerpt), and optionally
fragment_key.
Returns a deduplicated list of :class:ConsolidatedTriple sorted by
final_confidence descending.
Source code in src/riverbank/postprocessors/consolidate.py
| def consolidate(
self,
triples: Sequence[Any],
) -> list[ConsolidatedTriple]:
"""Consolidate a flat list of (possibly duplicate) extracted triples.
Each element of *triples* must expose the attributes:
``subject``, ``predicate``, ``object_value``, ``confidence``,
``evidence`` (with ``.source_iri``, ``.excerpt``), and optionally
``fragment_key``.
Returns a deduplicated list of :class:`ConsolidatedTriple` sorted by
``final_confidence`` descending.
"""
# Group raw extractions by normalised triple key
groups: dict[TripleKey, list[Any]] = {}
for t in triples:
key = _normalise_key(t)
groups.setdefault(key, []).append(t)
results: list[ConsolidatedTriple] = []
for key, group in groups.items():
subj, pred, obj = key
# Pick canonical (non-normalised) values from the highest-confidence instance
best = max(group, key=lambda t: float(getattr(t, "confidence", 0.0)))
canon_subj = getattr(best, "subject", subj)
canon_pred = getattr(best, "predicate", pred)
canon_obj = getattr(best, "object_value", obj)
prov_records, raw_confs, source_diversity = _build_provenance(group)
final_conf = _noisy_or_with_diversity(group)
results.append(
ConsolidatedTriple(
subject=canon_subj,
predicate=canon_pred,
object_value=canon_obj,
final_confidence=round(final_conf, 6),
raw_confidences=raw_confs,
provenance=prov_records,
source_diversity=source_diversity,
)
)
results.sort(key=lambda ct: ct.final_confidence, reverse=True)
return results
|
split_by_threshold(consolidated)
Split consolidated triples into (above_threshold, below_threshold).
Returns (trusted_candidates, remaining) where trusted_candidates
are those whose final_confidence >= trusted_threshold.
Source code in src/riverbank/postprocessors/consolidate.py
| def split_by_threshold(
self,
consolidated: Sequence[ConsolidatedTriple],
) -> tuple[list[ConsolidatedTriple], list[ConsolidatedTriple]]:
"""Split consolidated triples into (above_threshold, below_threshold).
Returns ``(trusted_candidates, remaining)`` where ``trusted_candidates``
are those whose ``final_confidence >= trusted_threshold``.
"""
trusted: list[ConsolidatedTriple] = []
remaining: list[ConsolidatedTriple] = []
for ct in consolidated:
if ct.final_confidence >= self.trusted_threshold:
trusted.append(ct)
else:
remaining.append(ct)
return trusted, remaining
|
ProvenanceRecord
dataclass
One extraction event contributing to a consolidated triple.
Source code in src/riverbank/postprocessors/consolidate.py
| @dataclass
class ProvenanceRecord:
"""One extraction event contributing to a consolidated triple."""
source_iri: str # document IRI
fragment_key: str # heading-path fragment key
confidence: float # per-extraction confidence
excerpt: str = "" # verbatim evidence excerpt
|