paper(medcat): MedCAT 2 paper related scripts and documentation by mart-r · Pull Request #526 · CogStack/cogstack-nlp

mart-r · 2026-06-03T17:44:24Z

No description provided.

… speeds

…ts for a single model

This reverts commit c8d80ad.

alhendrickson

Minor one from me, could you make a README.md in the medcat-v2/paper folder? Super high level one that basically says run pip install somewhere, then get the required data following each data folder, then ./run_all_at_once.sh or whatever.

Just something that explains /paper folder from a dev perspective with this repo open, I get the paper itself will explain a lot.

tomolopolis · 2026-06-09T09:53:27Z

looks good for paper result reproduction.
Can you confirm:

none of this is included in the install of the lib?
should any of these script runs be run in a GHA build?

mart-r · 2026-06-09T10:17:13Z

none of this is included in the install of the lib?

Goot point! As of now, it would have. But I've added a MANIFEST.in to avoid that.

should any of these script runs be run in a GHA build?

That isn't really possible. Not only is there no availability of models that this would work (full models, embedding linker model) for or the data (MIMIC raw data, linking datasets) that this uses, even if it did, the entire thing might take longer than the 6h allocated to job runners.

tomolopolis

yeah agreed - worth adding in the README what version of the lib that this was last tested with.
lgtm otherwise

* Add initial script to get individiual results * Add script to get overall results (startup, warm, cold) * Fix default args * Add master script for getting load speed for multiple models * Add v1 and v2 (and my localy setup) specific scripts for getting load speeds * Avoid unknown run types * Add option to specify number of repeats when doing all load experiments for a single model * Move to a timeit based approach * Add output folder * Add automatic json output * Fix type of save json argument * Always save results to a file when doing in bulk * Allow overwriting output prefix if/when required * Separated speed scripts from (future) performance ones * Move a bunch of code to a separate module * Allow for a more general error handling when running subprocesses * Add a few overarching scripts to run all the speed scripts at once * Centralise combining of experiments * Only produce results for run types that are required * Add uncommitted changes from last commit * Add modules to get inference speed * Fix serialisation issue * Add overall inference speed getter * Add setup-specific scripts for inference speed * Allow scripts to actually run * Fix a small issue (running load speed instead of inference speed) * Fix some argument issues in bash scripts * Make names file-name safe * Change divider type between scripts * Fix a bash script logic issue * Read output from last line * Fix issue with errorenoushly newlines * Improve output for getting time from stdout * Add some more output when doing inference speed * Fix some comment * Some whitespace changes * Fix typo * Remove unneeded output * Fix script running for specific version * Remove unused empty method * Add script to get unsupervised training speed as well * Add script to summarise output * Add script to combine all unsuperivsed training output for a particula model * Add scripts to get unsupervised speed overall * Add folder for inference and unsupervised training output * Removed empty / old files * Improve / fix profiling * Move version specified to common module * Reset subanmes after model load if v2 * Fix typo * Add subname reset when doing unsupervised training speed * Add some minor comments * Add initial regression performance script * some linting / whitespace fixes * some further linting / whitespace fixes * Add out/performance folder * Add script to get all of regression * Add conversion script for MDACE * Add mapping from ICD to Snomed * Add conversion for distemist dataset * Add a new stats methodology for multi-optioned datasets * Minor updates to new stats method * Update stats to allow projct processing with project filters * Add v1 implementation for missing stuff (hopefully) * Fix minor import path issues * Fix problematic dunder call method * Fix typo in name * Add performance script for model and dataset(s) * Remove commented code * Allow filtering before disamb (optionally) * Add README for MDACE dataset * Add README for distemist dataset * Add conversion script - from linking challenge to trainer export * Add README for linking challenge data prep * Add README for COMETA dataset * Add cometa dataset conversion script * Add medmentions conversion scripts * Remove some unneeded code * Add MedMentions dataset README * Keep unsupervised data folder * Add script to get all performance * Fix performance script * CU-869b9h7y6: Add faster linker that only links to primary names * CU-869b9h7y6: Remove debug output * CU-869b9h7y6: Add proper filtering as well as usage of single-possible CUI options * CU-869b9h7y6: Add a simple test for the new linker * Add a few scripts to show possible variance in performance and throughput * Update script to include embedding linker in it * Add embedding linker stuff to script * Start moving towards a better format for variance getting (get 1 output in CSV) * Remove some echo /debug output * Add dataset name to output * Add header to output * Add run time to output * Add 1 time embedding linker conversion script * Some whitespace changes * Make last line of conversion be the model path * Convert embedding model once * Try to redo filtering for embedding linker * Try to redo filtering for embedding linker (attempt no 2) * Try to redo filtering for embedding linker (attempt no 3) * Add a throughput script * Add throughput to variance calculations * Revert "Add a throughput script" This reverts commit c8d80ad. * Update model paths * Run performance against 2023 models again * Add a script to run everything at once * CU-869cw9zmj: Use faster way to calculate unit vector * CU-869cw9zmj: Speed up context vector obtaining * Add variance plotting script * Revert changes to matutils * Add Readme for paper * Add MANIFEST.in to avoid distributing paper specifics * Add note on which version of the lib this was used on in paper README --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

mart-r added 30 commits November 7, 2025 11:09

Add initial script to get individiual results

fc0e9c7

Add script to get overall results (startup, warm, cold)

f138663

Fix default args

fe04f49

Add master script for getting load speed for multiple models

7ce5e8d

Add v1 and v2 (and my localy setup) specific scripts for getting load…

06a513f

… speeds

Avoid unknown run types

a519fcc

Add option to specify number of repeats when doing all load experimen…

a0df0e6

…ts for a single model

Move to a timeit based approach

98a49fd

Add output folder

e294df4

Add automatic json output

074e7a1

Fix type of save json argument

a80b780

Always save results to a file when doing in bulk

2b05b70

Allow overwriting output prefix if/when required

39a47b5

Separated speed scripts from (future) performance ones

3ac3d0d

Move a bunch of code to a separate module

2652fd5

Allow for a more general error handling when running subprocesses

d70054d

Add a few overarching scripts to run all the speed scripts at once

7f2ff90

Centralise combining of experiments

45f03a6

Only produce results for run types that are required

9d2ee46

Add uncommitted changes from last commit

6c0164f

Add modules to get inference speed

2390d2b

Fix serialisation issue

fc7b065

Add overall inference speed getter

21ecc3d

Add setup-specific scripts for inference speed

48cf5f3

Allow scripts to actually run

a9bd75f

Fix a small issue (running load speed instead of inference speed)

b5283a2

Fix some argument issues in bash scripts

340f4fe

Make names file-name safe

b96a879

Change divider type between scripts

546a7d8

Fix a bash script logic issue

25a1fbc

mart-r and others added 20 commits November 30, 2025 09:09

Convert embedding model once

1d63dae

Merge branch 'main' into medcat-v2-paper-and-faster-linker

279e59e

Try to redo filtering for embedding linker

040af18

Try to redo filtering for embedding linker (attempt no 2)

540caaa

Try to redo filtering for embedding linker (attempt no 3)

d336bfe

Merge branch 'main' into medcat-v2-paper-and-faster-linker

a0fc0d8

Merge branch 'main' into medcat-v2-paper-and-faster-linker

8f48e50

Add a throughput script

c8d80ad

Add throughput to variance calculations

56c6cfb

Revert "Add a throughput script"

93daa30

This reverts commit c8d80ad.

Merge branch 'main' into medcat-v2-paper-and-faster-linker

2908e69

Merge branch 'main' into medcat-v2-paper-and-faster-linker

3e3c018

Update model paths

7663949

Run performance against 2023 models again

fcfd725

Add a script to run everything at once

ae74fce

CU-869cw9zmj: Use faster way to calculate unit vector

1db6f6a

CU-869cw9zmj: Speed up context vector obtaining

887a180

Merge branch 'main' into medcat-v2-paper-and-faster-linker-w-faster-gcv

de3c25e

Add variance plotting script

7972292

Revert changes to matutils

d9a74fb

alhendrickson reviewed Jun 8, 2026

View reviewed changes

Add Readme for paper

d2f572e

Add MANIFEST.in to avoid distributing paper specifics

8dc03bc

tomolopolis approved these changes Jun 9, 2026

View reviewed changes

Add note on which version of the lib this was used on in paper README

7d546cb

mart-r merged commit b8c07a0 into main Jun 9, 2026
23 checks passed

mart-r deleted the medcat-v2-paper-and-faster-linker-w-faster-gcv branch June 9, 2026 14:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

paper(medcat): MedCAT 2 paper related scripts and documentation#526

paper(medcat): MedCAT 2 paper related scripts and documentation#526
mart-r merged 123 commits into
mainfrom
medcat-v2-paper-and-faster-linker-w-faster-gcv

mart-r commented Jun 3, 2026

Uh oh!

alhendrickson left a comment

Uh oh!

tomolopolis commented Jun 9, 2026

Uh oh!

mart-r commented Jun 9, 2026

Uh oh!

tomolopolis left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mart-r commented Jun 3, 2026

Uh oh!

alhendrickson left a comment

Choose a reason for hiding this comment

Uh oh!

tomolopolis commented Jun 9, 2026

Uh oh!

mart-r commented Jun 9, 2026

Uh oh!

tomolopolis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants