Skip to content

forrtproject/flora-extractor

Repository files navigation

FLoRA Extractor

A Python pipeline that discovers, extracts, and monitors replication and reproduction studies for the FLoRA database.

Part of the FORRT project.


What It Does

Starting from keyword searches of academic databases, FLoRA Extractor:

  1. Discovers candidate replication/reproduction papers from OpenAlex and curated lists
  2. Filters false positives using rule-based and LLM classification
  3. Extracts the target study and replication outcome from each paper
  4. Monitors extraction progress through a web dashboard; validation happens in a separate Supabase-backed repo

Architecture

Stage 1: search/      → data/candidates.csv   (discover candidates)
Stage 2: filter/      → data/filtered.csv     (remove false positives)
Stage 3: extract/     → data/extracted.csv    (link original + code outcome)
Stage 4: validate/    → monitoring web app    (dashboard at localhost:5001)
                             ↕
                      Supabase (separate validation repo)

Each stage is independently runnable.


Quick Start

git clone <repo-url>
cd flora-extractor
pip install -r requirements.txt
cp .env.example .env   # fill in your API keys

# Run the pipeline
python -m search.run_search
python -m filter.run_filter
python -m extract.run_extract

# Start the monitoring web app
python -m validate.app   # → http://localhost:5001

See docs/setup.md for full setup instructions.


Required environment variables

RESEARCHER_EMAIL=you@example.com   # for OpenAlex/Crossref API politeness
GEMINI_API_KEY=...                 # primary LLM (free at aistudio.google.com)

Optional: OPENAI_API_KEY, OPENROUTER_API_KEY, SUPABASE_URL, SUPABASE_SERVICE_KEY, GROBID_URL. See .env.example.


Documentation

Document Description
docs/setup.md Installation and running the pipeline
docs/architecture.md Module map and design decisions
docs/cli-reference.md All CLI commands and flags
docs/csv-schema.md CSV column definitions
docs/dashboard-guide.md Dashboard user guide
docs/supabase-schema.md Supabase validation table schemas
docs/testing.md Running and writing tests
docs/README.md Full documentation index

AI coding agent? Read CLAUDE.md first.


Data Sources

Source Coverage
OpenAlex Broad academic literature, free API
Semantic Scholar Supplementary coverage
Bob Reed's Replication Network Economics replications
I4R Institute for Replication reports

Full-text: Unpaywall, CORE, arXiv, OSF. DOI resolution: Crossref.


Contributing

  1. Branch from dev (feature/search, feature/filter, feature/extract, feature/validate)
  2. Test with sample data in misc/
  3. Open a PR to dev when a feature is stable — don't wait until the end
  4. main and dev are branch-protected; all merges require a PR review

Related

License

MIT

About

FloRA extractor

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors