Skip to content

Community scouting + GTDB grounding + computational-provenance; 29 new community records#180

Open
realmarcin wants to merge 6 commits into
mainfrom
claude/metpo-cultivation-proposal
Open

Community scouting + GTDB grounding + computational-provenance; 29 new community records#180
realmarcin wants to merge 6 commits into
mainfrom
claude/metpo-cultivation-proposal

Conversation

@realmarcin

Copy link
Copy Markdown
Contributor

Summary

Adds a literature-scouting → curation pipeline, GTDB grounding of taxa (from the local kg-microbe NCBI↔GTDB mapping), a computational-provenance schema structure, and 29 new evidence-backed community records produced by that pipeline.

Note: this branch also carries the earlier commit 4447ea4 (METPO ROBOT-template cultivation proposal), which predates this work. Happy to split it into its own PR if preferred.

Schema (regenerated datamodel)

  • ComputationalProvenance on EvidenceItem — queryable tool/version/model/medium provenance for model-derived (COMPUTATIONAL) evidence (+ ComputationalPredictionTypeEnum).
  • GtdbClassification on TaxonDescriptorgtdb_id/taxon/lineage/ncbi_source_id/majority_fraction/is_reclassified/mapping_source. gtdb_id is a pattern-checked string, not an OAK-bound term, so id↔label validation ignores it.

Skills & scripts

  • scout-communities (scripts/scout_communities.py, scout_loop.py) — Europe PMC discovery of newly published communities, deduped vs kb/communities/, community-signal scoring, loop-until-dry sweep.
  • ground-taxa-gtdb (scripts/gtdb_ground.py) — rank-aware GTDB grounding: species → s__ (id then name fallback), genus/family/order → g__/f__/o__ by genome-weighted majority (≥50%), else AMBIGUOUS. --apply writes blocks via add-only text edits. Sources data/raw/NCBI2GTDB.tsv.gz from a local kg-microbe checkout.

Records — 29 new (CommunityMech:000272, 000274000302)

From a scout sweep: 40 score≥4 leads → triaged to 30 curatable communities → minus 1 exact-paper dup → plus 3 leads whose DOIs were recovered via CrossRef. Each grounds taxa in NCBITaxon (OAK) + GTDB + ENVO with verbatim evidence snippets and abstract-supported interactions. GTDB reclassifications captured (e.g. A. deltaeA. leguminum, EnterococcusEnterococcus_B, C. aceticumClostridium_W aceticum). Spans bioremediation, syntrophy/electrosynthesis, lignocellulose, and gut/diet consortia. 000273 intentionally skipped (dup).

  • Two ANME-SRB records (000134000276) cross-referenced via CROSS_REFERENCE curation events as consolidation candidates.

Validation

validate-all (all records), validate-terms-all, per-record validate + validate-terms, and the 227-test suite all pass. No duplicate papers across the KB or within the batch.

Known follow-ups

  • 000274 and 000285 are thin (members named only as functional groups); their source papers are closed-access (Unpaywall confirms no OA), so enrichment needs full-text access.
  • Genus/family GTDB grounding uses a ≥50% genome-majority cutoff; borderline cases (e.g. Bacillus at 0.51 across 102 GTDB genera) are flagged in mapping_source.

🤖 Generated with Claude Code

realmarcin and others added 6 commits June 21, 2026 21:07
Lifts the cultivation hardware/mode slice added in PR #171 into a new METPO
proposal cohort: CultivationModeEnum (10 leaves), CultivationSystemEnum (13
minted leaves + BIOREACTOR_UNSPECIFIED mapped to existing OBI:0001046), under a
new "microbial community cultivation setup" domain class, plus 3 cultivation
object properties.

- Classes: METPO:1008100-1008132 (above the v1 high-water mark 1008013; no overlap)
- Predicates: METPO:2008100-2008102 (above 2008002)
- Subset: metpo_communitymech_2026_06
- Term metadata reused from vocab/cultivation_terms.yaml

Verified: ROBOT column counts (11/12), full enum coverage, parent integrity,
no duplicate or v1-overlapping IDs. METPO submission is downstream/external.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…schema

Discovery + grounding + provenance tooling, with one pilot record exercising all three.

Skills & scripts:
- scout-communities: Europe PMC discovery of newly published communities, deduped
  vs kb/communities (cited PMID/DOI + name overlap), community-signal scoring, and a
  loop-until-dry multi-angle sweep (scripts/scout_communities.py, scout_loop.py).
- ground-taxa-gtdb: GTDB grounding from the local kg-microbe NCBI2GTDB mapping;
  resolves NCBITaxon id/name/community to GTDB CURIE+lineage+confidence, flags
  reclassifications (scripts/gtdb_ground.py).

Schema (regenerated datamodel):
- ComputationalProvenance on EvidenceItem: queryable tool/version/model/medium
  provenance for model-derived (COMPUTATIONAL) evidence + ComputationalPredictionTypeEnum.
- GtdbClassification on TaxonDescriptor: gtdb_id/taxon/lineage/ncbi_source_id/
  majority_fraction/is_reclassified/mapping_source (gtdb_id is a pattern-checked
  string, not an OAK-bound Term, so id-label validation ignores it).

Pilot record CommunityMech:000272 (SynCom Y, A. deltae + B. velezensis biofilm
co-culture): NCBITaxon + GTDB grounding on both taxa (A. deltae -> GTDB
Agrobacterium leguminum, reclassified), cross-feeding interactions with
computational_provenance (CarveMe / COBRA Toolbox v3.0 / FBA, LBGM, SRA genomes).

Validated: linkml-validate + validate-terms on the record, validate-all (286
records), 197 tests pass.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ty handling

The kg-microbe NCBI2GTDB table is keyed on strain/genome-level NCBI ids, so
species-level ids (e.g. Shewanella oneidensis NCBITaxon:70863) miss on exact-id
lookup even though the species is in GTDB. Fall back to NCBI-species-name
matching; when GTDB splits one NCBI species into several (e.g. Bacillus cereus
-> Bacillus_A cereus/anthracis/thuringiensis...), report AMBIGUOUS and emit no
forced grounding rather than guessing.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Batch-minted from the scout-communities discovery sweep: 40 score>=4 leads
triaged to 30 curatable communities (6 reviews, 4 too-vague, 1 methods paper
dropped), minus 1 exact-paper duplicate of the pilot and 3 leads with no
citable source.

Each record was drafted by an isolated agent against a shared spec and grounds
taxa in NCBITaxon (OAK canonical labels) + GTDB (local kg-microbe NCBI2GTDB
mapping) + ENVO, with abstract-verbatim evidence snippets and abstract-supported
ecological interactions. GTDB reclassifications are captured where they occur
(e.g. Agrobacterium deltae -> Agrobacterium leguminum; Clostridium aceticum ->
Clostridium_W aceticum). 11/26 carry species-level GTDB grounding; the rest have
genus-rank, eukaryotic, or GTDB-ambiguous taxa (genus-level grounding is a
pending follow-up).

Spans bioremediation, syntrophy/electrosynthesis, lignocellulose, and gut/diet
consortia. IDs 000273/000278/000292/000302 intentionally skipped (dup + 3
no-source leads).

Validated: validate-all (all records), per-record validate + validate-terms,
227-test suite. No duplicate papers across the KB or within the batch.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Ground taxa at the rank of the input, not species-only:
- Species (binomial) -> GTDB:s__ via NCBI id then species-name fallback; GTDB
  species split -> AMBIGUOUS.
- Genus/family/order/... (single-name label) -> GTDB:g__/f__/o__: aggregate the
  GTDB rank column over all genomes under the NCBI taxon and ground to the GTDB
  taxon holding a >=50% genome majority, else AMBIGUOUS. mapping_source records
  the rank and the number of GTDB taxa under the NCBI taxon (e.g. NCBI genus
  Bacillus -> g__Bacillus at 0.51, 102 GTDB genera; Shewanella/Bifidobacterium
  ~1.0). Corrects the earlier misconception that genus rank is "not in GTDB" —
  GTDB is a full hierarchy; the tool just hadn't grounded above species.

New --apply mode writes gtdb_classification into a community file via add-only
text edits (no YAML round-trip, so plain-scalar wrapping and everything else is
byte-for-byte unchanged; skips taxa already grounded and AMBIGUOUS ones).

Backfilled 24 genus-level blocks across 12 batch records (274-301). GTDB is
bacteria/archaea only, so eukaryotic members still never ground.

Validated: per-record validate + validate-terms on all changed files, 227 tests.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…sortia

Task 4 (no-source leads): the 3 leads dropped earlier for lacking a citable id
were not false positives — all real 2025 SynCom papers indexed in Europe PMC
only via the AGR source (no PMID). Recovered their DOIs via CrossRef and curated
them into the reserved gap ids:
- CommunityMech:000278  SynCom BsBv cigar-tobacco leaf fermentation (B. safensis +
  B. velezensis) — doi:10.1016/j.indcrop.2025.122621
- CommunityMech:000292  Pseudomonas + Rahnella Artemisia argyi phytoremediation
  SynCom — doi:10.1016/j.indcrop.2025.122518
- CommunityMech:000302  MetG2 rhizobacteria SynCom for sugarcane stress resilience
  — doi:10.1016/j.rhisph.2025.101142
Each grounds taxa in NCBITaxon + GTDB (genus/species as named) + ENVO with
verbatim DOI-abstract evidence.

Task 3 (dedup/curation): added CROSS_REFERENCE curation events linking
CommunityMech:000134 (canonical natural ANME-SRB seep consortium) and
CommunityMech:000276 (redox-conduction/DIET study of the same syntrophy) as
candidates for future consolidation.

Task 2 (thin stubs 000274, 000285): deep-research enrichment attempted; both
source papers are paywalled with no OA full text and no accessible source names
their members, so no taxa were added (no fabrication). Left as-is.

Validated: per-record validate + validate-terms on all changed files.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings July 2, 2026 22:42

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands CommunityMech’s curation workflow and schema to support (1) automated literature scouting for new microbial communities, (2) GTDB grounding alongside existing NCBITaxon grounding, and (3) structured computational provenance for model-derived evidence—then adds a batch of new community records and cached reference artifacts produced by that pipeline.

Changes:

  • Extend the LinkML schema with ComputationalProvenance (plus ComputationalPredictionTypeEnum) and GtdbClassification for queryable computational evidence + GTDB grounding.
  • Add Europe PMC scouting scripts (scout_communities.py, scout_loop.py) and corresponding just recipes + Claude skill docs.
  • Add a METPO cultivation cohort proposal and many new/updated KB community records + reference-cache artifacts.

Reviewed changes

Copilot reviewed 54 out of 54 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/communitymech/schema/communitymech.yaml Adds schema structures for computational provenance on evidence items and GTDB grounding on taxon descriptors.
scripts/scout_loop.py Adds a “loop-until-dry” runner that repeatedly scouts Europe PMC across multiple query angles with global deduping and consolidated output.
scripts/scout_communities.py Adds the core Europe PMC scouting + scoring + deduping implementation and optional stub emission.
references_cache/PMID_42125264.md Adds cached abstract-only reference content for a newly curated PMID used by new community records.
references_cache/epmc_chem_PMC8072228.txt Adds extracted chemical-term cache content associated with a referenced PMC article.
references_cache/epmc_chem_PMC5373366.txt Adds extracted chemical-term cache content associated with a referenced PMC article.
references_cache/epmc_chem_PMC4984294.txt Adds extracted chemical-term cache content associated with a referenced PMC article.
references_cache/epmc_chem_PMC4510191.txt Adds extracted chemical-term cache content associated with a referenced PMC article.
references_cache/epmc_chem_PMC242454.txt Adds extracted chemical-term cache content associated with a referenced PMC article.
references_cache/epmc_chem_PMC1287685.txt Adds extracted chemical-term cache content associated with a referenced PMC article.
proposals/metpo_communitymech_cultivation_v1/proposal.md Adds narrative proposal for lifting cultivation setup/mode/system terms into METPO.
proposals/metpo_communitymech_cultivation_v1/metpo_proposal_properties_robot.tsv Adds ROBOT template rows for proposed cultivation-related object properties.
proposals/metpo_communitymech_cultivation_v1/metpo_proposal_classes_robot.tsv Adds ROBOT template rows for proposed cultivation-related classes.
kb/communities/Thermophilic_Lignocellulose_Composting_SynCom_Biosanitization.yaml Adds a new curated community record (composting SynCom; includes GTDB grounding where available).
kb/communities/SynCom_Y_Agrobacterium_Bacillus_Biofilm_Biocontrol_Coculture.yaml Adds a new curated community record; exercises computational_provenance for model-predicted cross-feeding evidence.
kb/communities/SynCom_Pseudomonas_Rahnella_Artemisia_Phytoremediation.yaml Adds a new curated community record (rhizosphere SynCom; GTDB genus-level grounding).
kb/communities/SynCom_MetG2_Rhizobacteria_Sugarcane_Stress_Resilience.yaml Adds a new curated community record (plant host + phylum-level microbial grounding).
kb/communities/SynCom_Chlorella_sorokiniana_Biogas_Slurry_Coupling_System.yaml Adds a new curated community record (SynCom–microalga coupling system).
kb/communities/SynCom_BsBv_Cigar_Tobacco_Leaf_Fermentation.yaml Adds a new curated community record (two-member Bacillus SynCom for fermentation).
kb/communities/Shewanella_Pseudomonas_Fe0_Electrosyntrophic_Denitrifying_Consortium.yaml Adds a new curated community record (Fe0-driven electro-syntrophy denitrifying pair).
kb/communities/Shewanella_oneidensis_Rhodopseudomonas_palustris_Electrosyntrophic_Coculture.yaml Adds a new curated community record (engineered electro-syntrophic co-culture).
kb/communities/Pseudomonas_stutzeri_Rhodococcus_Naphthalene_Biochar_Engineered_Consortium.yaml Adds a new curated community record (biochar-bridged engineered bioremediation consortium).
kb/communities/Pseudomonas_putida_PpTE_Rhodococcus_RDK17_Terephthalic_Acid_Consortium.yaml Adds a new curated community record (TPA-fed coculture with competition mechanisms).
kb/communities/Pleuromutilin_Degrading_Artificial_Consortium_5_Strain.yaml Adds a new curated community record (5-strain antibiotic-degrading consortium; GTDB grounding for subset).
kb/communities/Pinus_armandii_Endophytic_Biocontrol_SynCom.yaml Adds a new curated community record (endophytic biocontrol SynCom; GTDB grounding where available).
kb/communities/Phosphitivorax_Methanoculleus_Lithosyntrophy_Phosphite_Coculture.yaml Adds a new curated community record (lithosyntrophy phosphite-oxidizing syntrophic pair).
kb/communities/Parachlorella_Saccharomyces_Mutualistic_Coculture.yaml Adds a new curated community record (microalga–yeast mutualistic co-culture).
kb/communities/Multi_stage_Anaerobic_Digestion_SynCom_YSJ_and_SynCom_J.yaml Adds a new curated community record (multi-stage AD SynComs; functional-group level).
kb/communities/Infant_Gut_Prebiotic_Response_SynCom.yaml Adds a new curated community record (infant-gut SynCom; prebiotic-driven interaction rewiring).
kb/communities/Ensifer_YF2_Sphingobacterium_Y2_Polyethylene_Degrading_Consortium.yaml Adds a new curated community record (PE degradation division of labor; GTDB grounding where available).
kb/communities/Electrosynthetic_Consortia_Shewanella_Clostridium_Acetobacterium_Acetate.yaml Adds a new curated community record (defined MES consortia; multiple IET modes).
kb/communities/Electrostimulated_Mixotrophic_VFA_Producing_Enrichment_Consortium.yaml Adds a new curated community record (electrostimulated enrichment; GTDB reclassification captured).
kb/communities/Ecoli_Bifidobacterium_bifidum_Infant_gut_HMO_Mutualism_Coculture.yaml Adds a new curated community record (infant-gut mutualistic cross-feeding pair).
kb/communities/Dual_Bacillus_coagulans_Pseudomonas_putida_Lactic_Acid_Coculture.yaml Adds a new curated community record (sequential inoculation + reinforcement member).
kb/communities/DIETsimp_Lignocellulose_to_Methane_DIET_Consortia.yaml Adds a new curated community record (DIET-based simplified methanogenic consortia).
kb/communities/Dehalococcoides_mccartyi_CWV2_Dechlorinating_Consortium.yaml Adds a new curated community record (dechlorination consortium; syntrophic H2 supply concept).
kb/communities/Crucian_Carp_Gut_Disease_Resistance_SynCom.yaml Adds a new curated community record (fish gut SynCom; pathogen suppression).
kb/communities/Corynebacterium_glutamicum_Shewanella_oneidensis_Succinic_Acid_Coculture.yaml Adds a new curated community record (engineered co-culture for succinate).
kb/communities/Composting_SynCom_Lignocellulose_Degradation_Humus.yaml Adds a new curated community record (SynCom-driven enrichment of native degraders).
kb/communities/Butyrivibrio_Selenomonas_Ruminococcus_Lignocellulolytic_Rumen_Consortium.yaml Adds a new curated community record (rumen consortium; model-predicted compatibility validated in vitro).
kb/communities/BSFL_Gut_SynCom_Bacillus_Lactobacillus_Issatchenkia.yaml Adds a new curated community record (gnotobiotic BSFL gut SynCom).
kb/communities/ANME_SRB_Marine_Methane_Seep_Consortium.yaml Updates an existing record with a cross-reference curation event for consolidation tracking.
kb/communities/ANME_SRB_Anaerobic_Methanotrophic_Syntrophic_Consortia.yaml Adds a new curated record (ANME/SRB syntrophy; redox conduction / DIET mechanism focus).
justfile Adds just recipes for scout-communities and ground-taxa-gtdb.
.claude/skills/scout-communities/skill.md Adds skill documentation for Europe PMC scouting workflow and outputs.
.claude/skills/ground-taxa-gtdb/skill.md Adds skill documentation for rank-aware GTDB grounding using the local kg-microbe mapping.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

"crossfeeding",
"syntrophy",
"syntrophic",
"mutualis",
description: >-
Canonical GTDB CURIE (spaces in the taxon name become underscores),
e.g. GTDB:s__Bacillus_velezensis or GTDB:g__Agrobacterium.
pattern: "^GTDB:[cdfgops]__.+"
Comment on lines +85 to +86
snippet: 'Paraclostridium (p Pseudomonas (p A. hydrophila (p < 0.05) by activating intestinal
immune responses and reinforcing the gut barrier.'
Comment on lines +105 to +106
snippet: 'Paraclostridium (p Pseudomonas (p A. hydrophila (p < 0.05) by activating intestinal
immune responses and reinforcing the gut barrier.'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants