Community scouting + GTDB grounding + computational-provenance; 29 new community records#180
Open
realmarcin wants to merge 6 commits into
Open
Community scouting + GTDB grounding + computational-provenance; 29 new community records#180realmarcin wants to merge 6 commits into
realmarcin wants to merge 6 commits into
Conversation
Lifts the cultivation hardware/mode slice added in PR #171 into a new METPO proposal cohort: CultivationModeEnum (10 leaves), CultivationSystemEnum (13 minted leaves + BIOREACTOR_UNSPECIFIED mapped to existing OBI:0001046), under a new "microbial community cultivation setup" domain class, plus 3 cultivation object properties. - Classes: METPO:1008100-1008132 (above the v1 high-water mark 1008013; no overlap) - Predicates: METPO:2008100-2008102 (above 2008002) - Subset: metpo_communitymech_2026_06 - Term metadata reused from vocab/cultivation_terms.yaml Verified: ROBOT column counts (11/12), full enum coverage, parent integrity, no duplicate or v1-overlapping IDs. METPO submission is downstream/external. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…schema Discovery + grounding + provenance tooling, with one pilot record exercising all three. Skills & scripts: - scout-communities: Europe PMC discovery of newly published communities, deduped vs kb/communities (cited PMID/DOI + name overlap), community-signal scoring, and a loop-until-dry multi-angle sweep (scripts/scout_communities.py, scout_loop.py). - ground-taxa-gtdb: GTDB grounding from the local kg-microbe NCBI2GTDB mapping; resolves NCBITaxon id/name/community to GTDB CURIE+lineage+confidence, flags reclassifications (scripts/gtdb_ground.py). Schema (regenerated datamodel): - ComputationalProvenance on EvidenceItem: queryable tool/version/model/medium provenance for model-derived (COMPUTATIONAL) evidence + ComputationalPredictionTypeEnum. - GtdbClassification on TaxonDescriptor: gtdb_id/taxon/lineage/ncbi_source_id/ majority_fraction/is_reclassified/mapping_source (gtdb_id is a pattern-checked string, not an OAK-bound Term, so id-label validation ignores it). Pilot record CommunityMech:000272 (SynCom Y, A. deltae + B. velezensis biofilm co-culture): NCBITaxon + GTDB grounding on both taxa (A. deltae -> GTDB Agrobacterium leguminum, reclassified), cross-feeding interactions with computational_provenance (CarveMe / COBRA Toolbox v3.0 / FBA, LBGM, SRA genomes). Validated: linkml-validate + validate-terms on the record, validate-all (286 records), 197 tests pass. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ty handling The kg-microbe NCBI2GTDB table is keyed on strain/genome-level NCBI ids, so species-level ids (e.g. Shewanella oneidensis NCBITaxon:70863) miss on exact-id lookup even though the species is in GTDB. Fall back to NCBI-species-name matching; when GTDB splits one NCBI species into several (e.g. Bacillus cereus -> Bacillus_A cereus/anthracis/thuringiensis...), report AMBIGUOUS and emit no forced grounding rather than guessing. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Batch-minted from the scout-communities discovery sweep: 40 score>=4 leads triaged to 30 curatable communities (6 reviews, 4 too-vague, 1 methods paper dropped), minus 1 exact-paper duplicate of the pilot and 3 leads with no citable source. Each record was drafted by an isolated agent against a shared spec and grounds taxa in NCBITaxon (OAK canonical labels) + GTDB (local kg-microbe NCBI2GTDB mapping) + ENVO, with abstract-verbatim evidence snippets and abstract-supported ecological interactions. GTDB reclassifications are captured where they occur (e.g. Agrobacterium deltae -> Agrobacterium leguminum; Clostridium aceticum -> Clostridium_W aceticum). 11/26 carry species-level GTDB grounding; the rest have genus-rank, eukaryotic, or GTDB-ambiguous taxa (genus-level grounding is a pending follow-up). Spans bioremediation, syntrophy/electrosynthesis, lignocellulose, and gut/diet consortia. IDs 000273/000278/000292/000302 intentionally skipped (dup + 3 no-source leads). Validated: validate-all (all records), per-record validate + validate-terms, 227-test suite. No duplicate papers across the KB or within the batch. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Ground taxa at the rank of the input, not species-only: - Species (binomial) -> GTDB:s__ via NCBI id then species-name fallback; GTDB species split -> AMBIGUOUS. - Genus/family/order/... (single-name label) -> GTDB:g__/f__/o__: aggregate the GTDB rank column over all genomes under the NCBI taxon and ground to the GTDB taxon holding a >=50% genome majority, else AMBIGUOUS. mapping_source records the rank and the number of GTDB taxa under the NCBI taxon (e.g. NCBI genus Bacillus -> g__Bacillus at 0.51, 102 GTDB genera; Shewanella/Bifidobacterium ~1.0). Corrects the earlier misconception that genus rank is "not in GTDB" — GTDB is a full hierarchy; the tool just hadn't grounded above species. New --apply mode writes gtdb_classification into a community file via add-only text edits (no YAML round-trip, so plain-scalar wrapping and everything else is byte-for-byte unchanged; skips taxa already grounded and AMBIGUOUS ones). Backfilled 24 genus-level blocks across 12 batch records (274-301). GTDB is bacteria/archaea only, so eukaryotic members still never ground. Validated: per-record validate + validate-terms on all changed files, 227 tests. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…sortia Task 4 (no-source leads): the 3 leads dropped earlier for lacking a citable id were not false positives — all real 2025 SynCom papers indexed in Europe PMC only via the AGR source (no PMID). Recovered their DOIs via CrossRef and curated them into the reserved gap ids: - CommunityMech:000278 SynCom BsBv cigar-tobacco leaf fermentation (B. safensis + B. velezensis) — doi:10.1016/j.indcrop.2025.122621 - CommunityMech:000292 Pseudomonas + Rahnella Artemisia argyi phytoremediation SynCom — doi:10.1016/j.indcrop.2025.122518 - CommunityMech:000302 MetG2 rhizobacteria SynCom for sugarcane stress resilience — doi:10.1016/j.rhisph.2025.101142 Each grounds taxa in NCBITaxon + GTDB (genus/species as named) + ENVO with verbatim DOI-abstract evidence. Task 3 (dedup/curation): added CROSS_REFERENCE curation events linking CommunityMech:000134 (canonical natural ANME-SRB seep consortium) and CommunityMech:000276 (redox-conduction/DIET study of the same syntrophy) as candidates for future consolidation. Task 2 (thin stubs 000274, 000285): deep-research enrichment attempted; both source papers are paywalled with no OA full text and no accessible source names their members, so no taxa were added (no fabrication). Left as-is. Validated: per-record validate + validate-terms on all changed files. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR expands CommunityMech’s curation workflow and schema to support (1) automated literature scouting for new microbial communities, (2) GTDB grounding alongside existing NCBITaxon grounding, and (3) structured computational provenance for model-derived evidence—then adds a batch of new community records and cached reference artifacts produced by that pipeline.
Changes:
- Extend the LinkML schema with
ComputationalProvenance(plusComputationalPredictionTypeEnum) andGtdbClassificationfor queryable computational evidence + GTDB grounding. - Add Europe PMC scouting scripts (
scout_communities.py,scout_loop.py) and correspondingjustrecipes + Claude skill docs. - Add a METPO cultivation cohort proposal and many new/updated KB community records + reference-cache artifacts.
Reviewed changes
Copilot reviewed 54 out of 54 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/communitymech/schema/communitymech.yaml | Adds schema structures for computational provenance on evidence items and GTDB grounding on taxon descriptors. |
| scripts/scout_loop.py | Adds a “loop-until-dry” runner that repeatedly scouts Europe PMC across multiple query angles with global deduping and consolidated output. |
| scripts/scout_communities.py | Adds the core Europe PMC scouting + scoring + deduping implementation and optional stub emission. |
| references_cache/PMID_42125264.md | Adds cached abstract-only reference content for a newly curated PMID used by new community records. |
| references_cache/epmc_chem_PMC8072228.txt | Adds extracted chemical-term cache content associated with a referenced PMC article. |
| references_cache/epmc_chem_PMC5373366.txt | Adds extracted chemical-term cache content associated with a referenced PMC article. |
| references_cache/epmc_chem_PMC4984294.txt | Adds extracted chemical-term cache content associated with a referenced PMC article. |
| references_cache/epmc_chem_PMC4510191.txt | Adds extracted chemical-term cache content associated with a referenced PMC article. |
| references_cache/epmc_chem_PMC242454.txt | Adds extracted chemical-term cache content associated with a referenced PMC article. |
| references_cache/epmc_chem_PMC1287685.txt | Adds extracted chemical-term cache content associated with a referenced PMC article. |
| proposals/metpo_communitymech_cultivation_v1/proposal.md | Adds narrative proposal for lifting cultivation setup/mode/system terms into METPO. |
| proposals/metpo_communitymech_cultivation_v1/metpo_proposal_properties_robot.tsv | Adds ROBOT template rows for proposed cultivation-related object properties. |
| proposals/metpo_communitymech_cultivation_v1/metpo_proposal_classes_robot.tsv | Adds ROBOT template rows for proposed cultivation-related classes. |
| kb/communities/Thermophilic_Lignocellulose_Composting_SynCom_Biosanitization.yaml | Adds a new curated community record (composting SynCom; includes GTDB grounding where available). |
| kb/communities/SynCom_Y_Agrobacterium_Bacillus_Biofilm_Biocontrol_Coculture.yaml | Adds a new curated community record; exercises computational_provenance for model-predicted cross-feeding evidence. |
| kb/communities/SynCom_Pseudomonas_Rahnella_Artemisia_Phytoremediation.yaml | Adds a new curated community record (rhizosphere SynCom; GTDB genus-level grounding). |
| kb/communities/SynCom_MetG2_Rhizobacteria_Sugarcane_Stress_Resilience.yaml | Adds a new curated community record (plant host + phylum-level microbial grounding). |
| kb/communities/SynCom_Chlorella_sorokiniana_Biogas_Slurry_Coupling_System.yaml | Adds a new curated community record (SynCom–microalga coupling system). |
| kb/communities/SynCom_BsBv_Cigar_Tobacco_Leaf_Fermentation.yaml | Adds a new curated community record (two-member Bacillus SynCom for fermentation). |
| kb/communities/Shewanella_Pseudomonas_Fe0_Electrosyntrophic_Denitrifying_Consortium.yaml | Adds a new curated community record (Fe0-driven electro-syntrophy denitrifying pair). |
| kb/communities/Shewanella_oneidensis_Rhodopseudomonas_palustris_Electrosyntrophic_Coculture.yaml | Adds a new curated community record (engineered electro-syntrophic co-culture). |
| kb/communities/Pseudomonas_stutzeri_Rhodococcus_Naphthalene_Biochar_Engineered_Consortium.yaml | Adds a new curated community record (biochar-bridged engineered bioremediation consortium). |
| kb/communities/Pseudomonas_putida_PpTE_Rhodococcus_RDK17_Terephthalic_Acid_Consortium.yaml | Adds a new curated community record (TPA-fed coculture with competition mechanisms). |
| kb/communities/Pleuromutilin_Degrading_Artificial_Consortium_5_Strain.yaml | Adds a new curated community record (5-strain antibiotic-degrading consortium; GTDB grounding for subset). |
| kb/communities/Pinus_armandii_Endophytic_Biocontrol_SynCom.yaml | Adds a new curated community record (endophytic biocontrol SynCom; GTDB grounding where available). |
| kb/communities/Phosphitivorax_Methanoculleus_Lithosyntrophy_Phosphite_Coculture.yaml | Adds a new curated community record (lithosyntrophy phosphite-oxidizing syntrophic pair). |
| kb/communities/Parachlorella_Saccharomyces_Mutualistic_Coculture.yaml | Adds a new curated community record (microalga–yeast mutualistic co-culture). |
| kb/communities/Multi_stage_Anaerobic_Digestion_SynCom_YSJ_and_SynCom_J.yaml | Adds a new curated community record (multi-stage AD SynComs; functional-group level). |
| kb/communities/Infant_Gut_Prebiotic_Response_SynCom.yaml | Adds a new curated community record (infant-gut SynCom; prebiotic-driven interaction rewiring). |
| kb/communities/Ensifer_YF2_Sphingobacterium_Y2_Polyethylene_Degrading_Consortium.yaml | Adds a new curated community record (PE degradation division of labor; GTDB grounding where available). |
| kb/communities/Electrosynthetic_Consortia_Shewanella_Clostridium_Acetobacterium_Acetate.yaml | Adds a new curated community record (defined MES consortia; multiple IET modes). |
| kb/communities/Electrostimulated_Mixotrophic_VFA_Producing_Enrichment_Consortium.yaml | Adds a new curated community record (electrostimulated enrichment; GTDB reclassification captured). |
| kb/communities/Ecoli_Bifidobacterium_bifidum_Infant_gut_HMO_Mutualism_Coculture.yaml | Adds a new curated community record (infant-gut mutualistic cross-feeding pair). |
| kb/communities/Dual_Bacillus_coagulans_Pseudomonas_putida_Lactic_Acid_Coculture.yaml | Adds a new curated community record (sequential inoculation + reinforcement member). |
| kb/communities/DIETsimp_Lignocellulose_to_Methane_DIET_Consortia.yaml | Adds a new curated community record (DIET-based simplified methanogenic consortia). |
| kb/communities/Dehalococcoides_mccartyi_CWV2_Dechlorinating_Consortium.yaml | Adds a new curated community record (dechlorination consortium; syntrophic H2 supply concept). |
| kb/communities/Crucian_Carp_Gut_Disease_Resistance_SynCom.yaml | Adds a new curated community record (fish gut SynCom; pathogen suppression). |
| kb/communities/Corynebacterium_glutamicum_Shewanella_oneidensis_Succinic_Acid_Coculture.yaml | Adds a new curated community record (engineered co-culture for succinate). |
| kb/communities/Composting_SynCom_Lignocellulose_Degradation_Humus.yaml | Adds a new curated community record (SynCom-driven enrichment of native degraders). |
| kb/communities/Butyrivibrio_Selenomonas_Ruminococcus_Lignocellulolytic_Rumen_Consortium.yaml | Adds a new curated community record (rumen consortium; model-predicted compatibility validated in vitro). |
| kb/communities/BSFL_Gut_SynCom_Bacillus_Lactobacillus_Issatchenkia.yaml | Adds a new curated community record (gnotobiotic BSFL gut SynCom). |
| kb/communities/ANME_SRB_Marine_Methane_Seep_Consortium.yaml | Updates an existing record with a cross-reference curation event for consolidation tracking. |
| kb/communities/ANME_SRB_Anaerobic_Methanotrophic_Syntrophic_Consortia.yaml | Adds a new curated record (ANME/SRB syntrophy; redox conduction / DIET mechanism focus). |
| justfile | Adds just recipes for scout-communities and ground-taxa-gtdb. |
| .claude/skills/scout-communities/skill.md | Adds skill documentation for Europe PMC scouting workflow and outputs. |
| .claude/skills/ground-taxa-gtdb/skill.md | Adds skill documentation for rank-aware GTDB grounding using the local kg-microbe mapping. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "crossfeeding", | ||
| "syntrophy", | ||
| "syntrophic", | ||
| "mutualis", |
| description: >- | ||
| Canonical GTDB CURIE (spaces in the taxon name become underscores), | ||
| e.g. GTDB:s__Bacillus_velezensis or GTDB:g__Agrobacterium. | ||
| pattern: "^GTDB:[cdfgops]__.+" |
Comment on lines
+85
to
+86
| snippet: 'Paraclostridium (p Pseudomonas (p A. hydrophila (p < 0.05) by activating intestinal | ||
| immune responses and reinforcing the gut barrier.' |
Comment on lines
+105
to
+106
| snippet: 'Paraclostridium (p Pseudomonas (p A. hydrophila (p < 0.05) by activating intestinal | ||
| immune responses and reinforcing the gut barrier.' |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a literature-scouting → curation pipeline, GTDB grounding of taxa (from the local kg-microbe NCBI↔GTDB mapping), a computational-provenance schema structure, and 29 new evidence-backed community records produced by that pipeline.
Schema (regenerated datamodel)
ComputationalProvenanceonEvidenceItem— queryable tool/version/model/medium provenance for model-derived (COMPUTATIONAL) evidence (+ComputationalPredictionTypeEnum).GtdbClassificationonTaxonDescriptor—gtdb_id/taxon/lineage/ncbi_source_id/majority_fraction/is_reclassified/mapping_source.gtdb_idis a pattern-checked string, not an OAK-bound term, so id↔label validation ignores it.Skills & scripts
scout-communities(scripts/scout_communities.py,scout_loop.py) — Europe PMC discovery of newly published communities, deduped vskb/communities/, community-signal scoring, loop-until-dry sweep.ground-taxa-gtdb(scripts/gtdb_ground.py) — rank-aware GTDB grounding: species →s__(id then name fallback), genus/family/order →g__/f__/o__by genome-weighted majority (≥50%), else AMBIGUOUS.--applywrites blocks via add-only text edits. Sourcesdata/raw/NCBI2GTDB.tsv.gzfrom a local kg-microbe checkout.Records — 29 new (
CommunityMech:000272,000274–000302)From a scout sweep: 40 score≥4 leads → triaged to 30 curatable communities → minus 1 exact-paper dup → plus 3 leads whose DOIs were recovered via CrossRef. Each grounds taxa in NCBITaxon (OAK) + GTDB + ENVO with verbatim evidence snippets and abstract-supported interactions. GTDB reclassifications captured (e.g. A. deltae→A. leguminum, Enterococcus→Enterococcus_B, C. aceticum→Clostridium_W aceticum). Spans bioremediation, syntrophy/electrosynthesis, lignocellulose, and gut/diet consortia.
000273intentionally skipped (dup).000134↔000276) cross-referenced viaCROSS_REFERENCEcuration events as consolidation candidates.Validation
validate-all(all records),validate-terms-all, per-recordvalidate+validate-terms, and the 227-test suite all pass. No duplicate papers across the KB or within the batch.Known follow-ups
000274and000285are thin (members named only as functional groups); their source papers are closed-access (Unpaywall confirms no OA), so enrichment needs full-text access.mapping_source.🤖 Generated with Claude Code