An intelligent digital platform providing structured access to academic reference collections of Romanian literary heritage — built from raw InDesign exports to Elasticsearch-powered discovery across multiple corpuses.
Academic reference works in Romanian literary studies existed only as print publications and InDesign files — with no unified digital access, no editorial workflow, no way to search or update collaboratively. Rich content including formatted text, images, and bibliographies needed preservation.
A complete data pipeline: scrape InDesign HTML exports, extract semantic fields via CSS class mapping, serialize to JSONL, bulk-index into Elasticsearch, and serve through a FastAPI backend with React admin frontend — supporting collaborative editing, audit logging, and multi-corpus search.
BeautifulSoup4 scraper maps CSS classes to semantic fields, reconstructs hierarchical text from flat HTML, and extracts image references.
Elasticsearch full-text search across 5+ corpuses (ELIV, CLRV, HLRV, TLVR, DCLR) with Romanian diacritics handling and alphabetical navigation.
CKEditor5 and React-Quill for collaborative editing of scholarly entries with image management, captions, and formatting preservation.
Timeline-based navigation for historical corpuses (DCLR), alphabetical A-Z navigation including Romanian-specific letters (Ă, Â, Î, Ș, Ț).
Full audit trail of all modifications — who changed what, when, from which IP — maintaining academic integrity across collaborative workflows.
Romanian and English interface with role-based permissions ensuring only authorized scholars can edit entries.