Skip to content

paper(medcat): MedCAT 2 paper related scripts and documentation#526

Merged
mart-r merged 123 commits into
mainfrom
medcat-v2-paper-and-faster-linker-w-faster-gcv
Jun 9, 2026
Merged

paper(medcat): MedCAT 2 paper related scripts and documentation#526
mart-r merged 123 commits into
mainfrom
medcat-v2-paper-and-faster-linker-w-faster-gcv

Conversation

@mart-r

@mart-r mart-r commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

No description provided.

mart-r added 30 commits November 7, 2025 11:09

@alhendrickson alhendrickson left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor one from me, could you make a README.md in the medcat-v2/paper folder? Super high level one that basically says run pip install somewhere, then get the required data following each data folder, then ./run_all_at_once.sh or whatever.

Just something that explains /paper folder from a dev perspective with this repo open, I get the paper itself will explain a lot.

@tomolopolis

Copy link
Copy Markdown
Member

looks good for paper result reproduction.
Can you confirm:

  • none of this is included in the install of the lib?
  • should any of these script runs be run in a GHA build?

@mart-r

mart-r commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author
  • none of this is included in the install of the lib?

Goot point! As of now, it would have. But I've added a MANIFEST.in to avoid that.

  • should any of these script runs be run in a GHA build?

That isn't really possible. Not only is there no availability of models that this would work (full models, embedding linker model) for or the data (MIMIC raw data, linking datasets) that this uses, even if it did, the entire thing might take longer than the 6h allocated to job runners.

@tomolopolis tomolopolis left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah agreed - worth adding in the README what version of the lib that this was last tested with.
lgtm otherwise

@mart-r mart-r merged commit b8c07a0 into main Jun 9, 2026
23 checks passed
@mart-r mart-r deleted the medcat-v2-paper-and-faster-linker-w-faster-gcv branch June 9, 2026 14:45
tomolopolis pushed a commit that referenced this pull request Jun 10, 2026
* Add initial script to get individiual results

* Add script to get overall results (startup, warm, cold)

* Fix default args

* Add master script for getting load speed for multiple models

* Add v1 and v2 (and my localy setup) specific scripts for getting load speeds

* Avoid unknown run types

* Add option to specify number of repeats when doing all load experiments for a single model

* Move to a timeit based approach

* Add output folder

* Add automatic json output

* Fix type of save json argument

* Always save results to a file when doing in bulk

* Allow overwriting output prefix if/when required

* Separated speed scripts from (future) performance ones

* Move a bunch of code to a separate module

* Allow for a more general error handling when running subprocesses

* Add a few overarching scripts to run all the speed scripts at once

* Centralise combining of experiments

* Only produce results for run types that are required

* Add uncommitted changes from last commit

* Add modules to get inference speed

* Fix serialisation issue

* Add overall inference speed getter

* Add setup-specific scripts for inference speed

* Allow scripts to actually run

* Fix a small issue (running load speed instead of inference speed)

* Fix some argument issues in bash scripts

* Make names file-name safe

* Change divider type between scripts

* Fix a bash script logic issue

* Read output from last line

* Fix issue with errorenoushly newlines

* Improve output for getting time from stdout

* Add some more output when doing inference speed

* Fix some comment

* Some whitespace changes

* Fix typo

* Remove unneeded output

* Fix script running for specific version

* Remove unused empty method

* Add script to get unsupervised training speed as well

* Add script to summarise output

* Add script to combine all unsuperivsed training output for a particula model

* Add scripts to get unsupervised speed overall

* Add folder for inference and unsupervised training output

* Removed empty / old files

* Improve / fix profiling

* Move version specified to common module

* Reset subanmes after model load if v2

* Fix typo

* Add subname reset when doing unsupervised training speed

* Add some minor comments

* Add initial regression performance script

* some linting / whitespace fixes

* some further linting / whitespace fixes

* Add out/performance folder

* Add script to get all of regression

* Add conversion script for MDACE

* Add mapping from ICD to Snomed

* Add conversion for distemist dataset

* Add a new stats methodology for multi-optioned datasets

* Minor updates to new stats method

* Update stats to allow projct processing with project filters

* Add v1 implementation for missing stuff (hopefully)

* Fix minor import path issues

* Fix problematic dunder call method

* Fix typo in name

* Add performance script for model and dataset(s)

* Remove commented code

* Allow filtering before disamb (optionally)

* Add README for MDACE dataset

* Add README for distemist dataset

* Add conversion script - from linking challenge to trainer export

* Add README for linking challenge data prep

* Add README for COMETA dataset

* Add cometa dataset conversion script

* Add medmentions conversion scripts

* Remove some unneeded code

* Add MedMentions dataset README

* Keep unsupervised data folder

* Add script to get all performance

* Fix performance script

* CU-869b9h7y6: Add faster linker that only links to primary names

* CU-869b9h7y6: Remove debug output

* CU-869b9h7y6: Add proper filtering as well as usage of single-possible CUI options

* CU-869b9h7y6: Add a simple test for the new linker

* Add a few scripts to show possible variance in performance and throughput

* Update script to include embedding linker in it

* Add embedding linker stuff to script

* Start moving towards a better format for variance getting (get 1 output in CSV)

* Remove some echo /debug output

* Add dataset name to output

* Add header to output

* Add run time to output

* Add 1 time embedding linker conversion script

* Some whitespace changes

* Make last line of conversion be the model path

* Convert embedding model once

* Try to redo filtering for embedding linker

* Try to redo filtering for embedding linker (attempt no 2)

* Try to redo filtering for embedding linker (attempt no 3)

* Add a throughput script

* Add throughput to variance calculations

* Revert "Add a throughput script"

This reverts commit c8d80ad.

* Update model paths

* Run performance against 2023 models again

* Add a script to run everything at once

* CU-869cw9zmj: Use faster way to calculate unit vector

* CU-869cw9zmj: Speed up context vector obtaining

* Add variance plotting script

* Revert changes to matutils

* Add Readme for paper

* Add MANIFEST.in to avoid distributing paper specifics

* Add note on which version of the lib this was used on in paper README

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants