Pipeline stages¶

The riverbank compilation pipeline transforms raw documents into governed knowledge through a sequence of well-defined stages.

flowchart TD
    A[Source discovery] --> B[Parsing]
    B --> C[Distillation<br>optional]
    C --> D[Fragmentation]
    D --> E[Editorial policy gate]
    E --> F[Hash deduplication]
    F --> G[Vocabulary pass<br>optional]
    G --> H[LLM extraction]
    H --> I[SHACL validation]
    I --> J[Graph write]
    J --> K[Artifact dependency<br>registration]

1. Source discovery¶

The configured connector discovers documents. The filesystem connector walks a directory tree; custom connectors can pull from APIs, S3, or message queues.

Each discovered file is registered as a pgc:Source in _riverbank.sources with an IRI, content hash, and optional tenant ID.

2. Parsing¶

The parser converts the raw format into a normalized text representation with heading positions. Parsers are pluggable:

markdown — uses markdown-it-py, preserves heading structure
docling — handles PDF, DOCX, HTML via the Docling library

3. Distillation (optional)¶

When distillation.enabled: true in the profile, the distillation step runs immediately after parsing and before fragmentation. It selects and compresses extractable content, reducing the token cost of all downstream stages.

Distillation is a content selection problem, not raw compression: the step identifies provably non-extractable sections (references, navigation, captions, boilerplate) and removes them deterministically, then applies strategy-specific LLM transformation to the remainder.

Strategies:

Strategy	What it does	LLM calls
`boilerplate_removal`	Deterministic regex stripper — removes reference sections, footnotes, navigation, captions	0
`aggressive`	LLM compresses to core facts only (~10 kB)	1
`moderate`	LLM removes boilerplate, keeps factual sections verbatim (~30 kB, default)	1
`conservative`	LLM removes only navigation, references, and captions (~60–90% of original)	1
`section_aware`	Two-pass: classify each section by type, then LLM-summarise low-density sections	1–N
`budget_optimized`	Adaptive: estimates triples-per-kB from a sample, selects strategy to hit cost target	0–1

The distilled text replaces the original for all downstream stages. The original content_hash is preserved on the SourceRecord, so fragment-level deduplication continues to work correctly.

Caching: distillation outputs are cached by xxh3_128(content) + strategy + target_size. Re-ingesting an unchanged document costs zero additional LLM calls.

See Use document distillation for the full profile schema and worked examples.

4. Fragmentation¶

The fragmenter splits parsed content into compilation units. The heading fragmenter creates one fragment per heading section. Each fragment gets:

A stable fragment_key (heading path)
An xxh3_128 content hash for change detection
Character offsets for evidence span validation

5. Editorial policy gate¶

Before LLM extraction (which costs money), the editorial policy filters fragments:

min_fragment_length — skip fragments too short to contain useful knowledge
max_fragment_length — flag fragments that exceed context window limits
min_heading_depth — skip top-level headings that are just titles
allowed_languages — skip content in unsupported languages

Skipped fragments are recorded in the run stats, not silently dropped.

6. Hash deduplication¶

Each fragment's xxh3_128 hash is compared to the stored hash from the previous run. Unchanged fragments are skipped entirely — zero LLM calls for stable content.

This is the core of incremental compilation: re-ingesting a 1000-document corpus where 3 documents changed produces only 3 fragments worth of LLM calls.

7. Vocabulary pass (optional)¶

When run_mode_sequence includes vocabulary, a first pass extracts skos:Concept triples into the <vocab> named graph. This establishes canonical entity IRIs before the full extraction pass, so that relationship extraction can reference consistent entities rather than creating duplicates.

8. LLM extraction¶

The extractor sends the fragment text and profile prompt to the configured LLM and parses the response into structured triples. Each triple carries:

Subject, predicate, object — the RDF statement
Confidence — a float in [0.0, 1.0]
EvidenceSpan — exact character offsets + verbatim excerpt from the source

The EvidenceSpan contract is enforced: the excerpt must match the text at the declared offset. Fabricated citations are rejected.

9. SHACL validation¶

Extracted triples are validated against SHACL shapes:

Triples meeting the confidence threshold → trusted named graph
Triples below threshold → draft named graph (pending review)
Triples violating shape constraints → rejected with a pgc:LintFinding

10. Graph write¶

Valid triples are written to pg-ripple via load_triples_with_confidence(). Each carries:

prov:wasDerivedFrom → source fragment
pgc:confidence → extraction confidence
pgc:compiledAt → timestamp
pgc:byProfile → compiler profile reference

11. Artifact dependency registration¶

The artifact dependency graph (_riverbank.artifact_deps) records which compiled facts depend on which fragments. This enables:

Incremental invalidation — when a fragment changes, exactly the right facts are recompiled
riverbank explain — trace any fact back to its sources
Staleness detection — rendered pages know when their source facts changed