Add a custom parser¶
Parsers convert raw document formats into a normalized text representation that the fragmenter can split. riverbank ships parsers for Markdown and Docling-supported formats.
The base class¶
from riverbank.parsers.base import BaseParser, ParsedDocument
class MyParser(BaseParser):
name = "my-parser"
supported_extensions = [".rst", ".txt"]
def parse(self, file_path: str) -> ParsedDocument:
with open(file_path) as f:
content = f.read()
return ParsedDocument(
content=content,
headings=self._extract_headings(content),
metadata={"format": "rst"},
)
def _extract_headings(self, content: str) -> list[dict]:
# Return list of {"level": int, "text": str, "char_start": int}
...
Register via entry point¶
Key requirements¶
- Return a
ParsedDocumentwith the full text content and heading positions - Heading positions are used by the fragmenter to split the document
- The
supported_extensionsfield determines which files your parser handles - The parser must preserve character offsets accurately (evidence spans depend on this)