A Python pipeline that discovers, extracts, and monitors replication and reproduction studies for the FLoRA database.
Part of the FORRT project.
Starting from keyword searches of academic databases, FLoRA Extractor:
- Discovers candidate replication/reproduction papers from OpenAlex and curated lists
- Filters false positives using rule-based and LLM classification
- Extracts the target study and replication outcome from each paper
- Monitors extraction progress through a web dashboard; validation happens in a separate Supabase-backed repo
Stage 1: search/ → data/candidates.csv (discover candidates)
Stage 2: filter/ → data/filtered.csv (remove false positives)
Stage 3: extract/ → data/extracted.csv (link original + code outcome)
Stage 4: validate/ → monitoring web app (dashboard at localhost:5001)
↕
Supabase (separate validation repo)
Each stage is independently runnable.
git clone <repo-url>
cd flora-extractor
pip install -r requirements.txt
cp .env.example .env # fill in your API keys
# Run the pipeline
python -m search.run_search
python -m filter.run_filter
python -m extract.run_extract
# Start the monitoring web app
python -m validate.app # → http://localhost:5001See docs/setup.md for full setup instructions.
RESEARCHER_EMAIL=you@example.com # for OpenAlex/Crossref API politeness
GEMINI_API_KEY=... # primary LLM (free at aistudio.google.com)
Optional: OPENAI_API_KEY, OPENROUTER_API_KEY, SUPABASE_URL, SUPABASE_SERVICE_KEY, GROBID_URL. See .env.example.
| Document | Description |
|---|---|
| docs/setup.md | Installation and running the pipeline |
| docs/architecture.md | Module map and design decisions |
| docs/cli-reference.md | All CLI commands and flags |
| docs/csv-schema.md | CSV column definitions |
| docs/dashboard-guide.md | Dashboard user guide |
| docs/supabase-schema.md | Supabase validation table schemas |
| docs/testing.md | Running and writing tests |
| docs/README.md | Full documentation index |
AI coding agent? Read CLAUDE.md first.
| Source | Coverage |
|---|---|
| OpenAlex | Broad academic literature, free API |
| Semantic Scholar | Supplementary coverage |
| Bob Reed's Replication Network | Economics replications |
| I4R | Institute for Replication reports |
Full-text: Unpaywall, CORE, arXiv, OSF. DOI resolution: Crossref.
- Branch from
dev(feature/search,feature/filter,feature/extract,feature/validate) - Test with sample data in
misc/ - Open a PR to
devwhen a feature is stable — don't wait until the end mainanddevare branch-protected; all merges require a PR review
- FLoRA database — the database this pipeline feeds
- flora_search_approaches — original R-based pipeline
MIT