Skip to content

consolidate

Confidence consolidation (noisy-OR) with source diversity scoring (v0.12.1).

When the same triple (s, p, o) is extracted from multiple fragments, this module consolidates the per-extraction confidence values into a single final confidence using the noisy-OR formula:

c_final = 1 − ∏ᵢ (1 − cᵢ)

Source diversity scoring: Corroboration from multiple fragments of the same document counts as a single "vote" rather than independent evidence. This prevents correlated hallucinations from templated or copied documents from crossing the trusted threshold. Multi-provenance evidence spans are accumulated per triple so that every source remains traceable.

Usage::

from riverbank.postprocessors.consolidate import NoisyORConsolidator

consolidator = NoisyORConsolidator(trusted_threshold=0.75)
results = consolidator.consolidate(triples_with_fragments)
promoted, remaining = consolidator.split_by_threshold(results)

Data flow (called from riverbank promote-tentative)::

1. Query the tentative graph for all triples
2. Group by normalised (subject, predicate, object_value) key
3. Compute noisy-OR confidence with source-diversity weighting
4. Return ConsolidatedTriple list sorted by final confidence descending

ConsolidatedTriple dataclass

A triple whose confidence has been consolidated across multiple extractions.

Attributes

subject, predicate, object_value: The canonical triple components (as extracted — not normalised). final_confidence: Noisy-OR confidence after source-diversity weighting. raw_confidences: List of per-extraction confidence values before consolidation. provenance: List of :class:ProvenanceRecord for every contributing extraction. source_diversity: Number of distinct source documents that produced this triple. Fragments within the same document are de-duplicated before applying noisy-OR (one vote per document).

Source code in src/riverbank/postprocessors/consolidate.py
@dataclass
class ConsolidatedTriple:
    """A triple whose confidence has been consolidated across multiple extractions.

    Attributes
    ----------
    subject, predicate, object_value:
        The canonical triple components (as extracted — not normalised).
    final_confidence:
        Noisy-OR confidence after source-diversity weighting.
    raw_confidences:
        List of per-extraction confidence values before consolidation.
    provenance:
        List of :class:`ProvenanceRecord` for every contributing extraction.
    source_diversity:
        Number of *distinct* source documents that produced this triple.
        Fragments within the same document are de-duplicated before
        applying noisy-OR (one vote per document).
    """

    subject: str
    predicate: str
    object_value: str
    final_confidence: float
    raw_confidences: list[float] = field(default_factory=list)
    provenance: list[ProvenanceRecord] = field(default_factory=list)
    source_diversity: int = 1

NoisyORConsolidator

Consolidate per-fragment confidence values using noisy-OR.

Parameters

trusted_threshold: Confidence at or above which a consolidated triple is considered trusted. Default 0.75 matches the per-triple routing threshold from v0.12.0.

Source code in src/riverbank/postprocessors/consolidate.py
class NoisyORConsolidator:
    """Consolidate per-fragment confidence values using noisy-OR.

    Parameters
    ----------
    trusted_threshold:
        Confidence at or above which a consolidated triple is considered
        trusted.  Default 0.75 matches the per-triple routing threshold from
        v0.12.0.
    """

    def __init__(self, trusted_threshold: float = 0.75) -> None:
        self.trusted_threshold = trusted_threshold

    # ------------------------------------------------------------------
    # Public API
    # ------------------------------------------------------------------

    def consolidate(
        self,
        triples: Sequence[Any],
    ) -> list[ConsolidatedTriple]:
        """Consolidate a flat list of (possibly duplicate) extracted triples.

        Each element of *triples* must expose the attributes:
        ``subject``, ``predicate``, ``object_value``, ``confidence``,
        ``evidence`` (with ``.source_iri``, ``.excerpt``), and optionally
        ``fragment_key``.

        Returns a deduplicated list of :class:`ConsolidatedTriple` sorted by
        ``final_confidence`` descending.
        """
        # Group raw extractions by normalised triple key
        groups: dict[TripleKey, list[Any]] = {}
        for t in triples:
            key = _normalise_key(t)
            groups.setdefault(key, []).append(t)

        results: list[ConsolidatedTriple] = []
        for key, group in groups.items():
            subj, pred, obj = key
            # Pick canonical (non-normalised) values from the highest-confidence instance
            best = max(group, key=lambda t: float(getattr(t, "confidence", 0.0)))
            canon_subj = getattr(best, "subject", subj)
            canon_pred = getattr(best, "predicate", pred)
            canon_obj = getattr(best, "object_value", obj)

            prov_records, raw_confs, source_diversity = _build_provenance(group)
            final_conf = _noisy_or_with_diversity(group)

            results.append(
                ConsolidatedTriple(
                    subject=canon_subj,
                    predicate=canon_pred,
                    object_value=canon_obj,
                    final_confidence=round(final_conf, 6),
                    raw_confidences=raw_confs,
                    provenance=prov_records,
                    source_diversity=source_diversity,
                )
            )

        results.sort(key=lambda ct: ct.final_confidence, reverse=True)
        return results

    def split_by_threshold(
        self,
        consolidated: Sequence[ConsolidatedTriple],
    ) -> tuple[list[ConsolidatedTriple], list[ConsolidatedTriple]]:
        """Split consolidated triples into (above_threshold, below_threshold).

        Returns ``(trusted_candidates, remaining)`` where ``trusted_candidates``
        are those whose ``final_confidence >= trusted_threshold``.
        """
        trusted: list[ConsolidatedTriple] = []
        remaining: list[ConsolidatedTriple] = []
        for ct in consolidated:
            if ct.final_confidence >= self.trusted_threshold:
                trusted.append(ct)
            else:
                remaining.append(ct)
        return trusted, remaining

consolidate(triples)

Consolidate a flat list of (possibly duplicate) extracted triples.

Each element of triples must expose the attributes: subject, predicate, object_value, confidence, evidence (with .source_iri, .excerpt), and optionally fragment_key.

Returns a deduplicated list of :class:ConsolidatedTriple sorted by final_confidence descending.

Source code in src/riverbank/postprocessors/consolidate.py
def consolidate(
    self,
    triples: Sequence[Any],
) -> list[ConsolidatedTriple]:
    """Consolidate a flat list of (possibly duplicate) extracted triples.

    Each element of *triples* must expose the attributes:
    ``subject``, ``predicate``, ``object_value``, ``confidence``,
    ``evidence`` (with ``.source_iri``, ``.excerpt``), and optionally
    ``fragment_key``.

    Returns a deduplicated list of :class:`ConsolidatedTriple` sorted by
    ``final_confidence`` descending.
    """
    # Group raw extractions by normalised triple key
    groups: dict[TripleKey, list[Any]] = {}
    for t in triples:
        key = _normalise_key(t)
        groups.setdefault(key, []).append(t)

    results: list[ConsolidatedTriple] = []
    for key, group in groups.items():
        subj, pred, obj = key
        # Pick canonical (non-normalised) values from the highest-confidence instance
        best = max(group, key=lambda t: float(getattr(t, "confidence", 0.0)))
        canon_subj = getattr(best, "subject", subj)
        canon_pred = getattr(best, "predicate", pred)
        canon_obj = getattr(best, "object_value", obj)

        prov_records, raw_confs, source_diversity = _build_provenance(group)
        final_conf = _noisy_or_with_diversity(group)

        results.append(
            ConsolidatedTriple(
                subject=canon_subj,
                predicate=canon_pred,
                object_value=canon_obj,
                final_confidence=round(final_conf, 6),
                raw_confidences=raw_confs,
                provenance=prov_records,
                source_diversity=source_diversity,
            )
        )

    results.sort(key=lambda ct: ct.final_confidence, reverse=True)
    return results

split_by_threshold(consolidated)

Split consolidated triples into (above_threshold, below_threshold).

Returns (trusted_candidates, remaining) where trusted_candidates are those whose final_confidence >= trusted_threshold.

Source code in src/riverbank/postprocessors/consolidate.py
def split_by_threshold(
    self,
    consolidated: Sequence[ConsolidatedTriple],
) -> tuple[list[ConsolidatedTriple], list[ConsolidatedTriple]]:
    """Split consolidated triples into (above_threshold, below_threshold).

    Returns ``(trusted_candidates, remaining)`` where ``trusted_candidates``
    are those whose ``final_confidence >= trusted_threshold``.
    """
    trusted: list[ConsolidatedTriple] = []
    remaining: list[ConsolidatedTriple] = []
    for ct in consolidated:
        if ct.final_confidence >= self.trusted_threshold:
            trusted.append(ct)
        else:
            remaining.append(ct)
    return trusted, remaining

ProvenanceRecord dataclass

One extraction event contributing to a consolidated triple.

Source code in src/riverbank/postprocessors/consolidate.py
@dataclass
class ProvenanceRecord:
    """One extraction event contributing to a consolidated triple."""

    source_iri: str          # document IRI
    fragment_key: str        # heading-path fragment key
    confidence: float        # per-extraction confidence
    excerpt: str = ""        # verbatim evidence excerpt