FLoRA Extractor

A Python pipeline that discovers, extracts, and monitors replication and reproduction studies for the FLoRA database.

Part of the FORRT project.

What It Does

Starting from keyword searches of academic databases, FLoRA Extractor:

Discovers candidate replication/reproduction papers from OpenAlex and curated lists
Filters false positives using rule-based and LLM classification
Extracts the target study and replication outcome from each paper
Monitors extraction progress through a web dashboard; validation happens in a separate Supabase-backed repo

Architecture

Stage 1: search/      → data/candidates.csv   (discover candidates)
Stage 2: filter/      → data/filtered.csv     (remove false positives)
Stage 3: extract/     → data/extracted.csv    (link original + code outcome)
Stage 4: validate/    → monitoring web app    (dashboard at localhost:5001)
                             ↕
                      Supabase (separate validation repo)

Each stage is independently runnable.

Quick Start

git clone <repo-url>
cd flora-extractor
pip install -r requirements.txt
cp .env.example .env   # fill in your API keys

# Run the pipeline
python -m search.run_search
python -m filter.run_filter
python -m extract.run_extract

# Start the monitoring web app
python -m validate.app   # → http://localhost:5001

See docs/setup.md for full setup instructions.

Required environment variables

RESEARCHER_EMAIL=you@example.com   # for OpenAlex/Crossref API politeness
GEMINI_API_KEY=...                 # primary LLM (free at aistudio.google.com)

Optional: OPENAI_API_KEY, OPENROUTER_API_KEY, SUPABASE_URL, SUPABASE_SERVICE_KEY, GROBID_URL. See .env.example.

Documentation

Document	Description
docs/setup.md	Installation and running the pipeline
docs/architecture.md	Module map and design decisions
docs/cli-reference.md	All CLI commands and flags
docs/csv-schema.md	CSV column definitions
docs/dashboard-guide.md	Dashboard user guide
docs/supabase-schema.md	Supabase validation table schemas
docs/testing.md	Running and writing tests
docs/README.md	Full documentation index

AI coding agent? Read CLAUDE.md first.

Data Sources

Source	Coverage
OpenAlex	Broad academic literature, free API
Semantic Scholar	Supplementary coverage
Bob Reed's Replication Network	Economics replications
I4R	Institute for Replication reports

Full-text: Unpaywall, CORE, arXiv, OSF. DOI resolution: Crossref.

Contributing

Branch from dev (feature/search, feature/filter, feature/extract, feature/validate)
Test with sample data in misc/
Open a PR to dev when a feature is stable — don't wait until the end
main and dev are branch-protected; all merges require a PR review

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
.github/workflows		.github/workflows
analysis		analysis
data		data
docs		docs
examples		examples
extract		extract
filter		filter
misc		misc
search		search
shared		shared
tests		tests
tools		tools
validate		validate
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
Procfile		Procfile
README.md		README.md
cli.txt		cli.txt
debug.log		debug.log
nixpacks.toml		nixpacks.toml
requirements-web.txt		requirements-web.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FLoRA Extractor

What It Does

Architecture

Quick Start

Required environment variables

Documentation

Data Sources

Contributing

Related

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

FLoRA Extractor

What It Does

Architecture

Quick Start

Required environment variables

Documentation

Data Sources

Contributing

Related

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages