Enhanced Join Statistics by isaacrobinson2000 · Pull Request #13 · TravisWheelerLab/aurora

isaacrobinson2000 · 2026-06-10T04:24:19Z

Adds enhanced join scoring to aurora, based directly on the annotation data recieved.

The new technique involves a 2-pass algorithm.
In the first pass, for every family, we identify all joinable and not joinable alignments in the same family, using basic thresholds. Once we've done so, we estimate the following distributions per-family:

$$\large{ P(C \mid J_f),\quad P(C \mid \neg\ J_f),\quad P(U \mid J_f),\quad P(U \mid \neg\ J_f),\quad P(D \mid J_f),\quad P(D \mid \neg\ J_f), \quad P(J_f),\quad P(\neg\ J_f) }$$

Where:

$J_f$ is the set of joined alignments for a given consensus sequence, or family.
$C$ is the set of consensus distances.
$U$ is the set of unexplained bases between alignments (same as target distance, but exclude alignments in-between).
$D$ is the set of absolute divergence differences (differences of kimura80 values) between alignments.

Notice, we don't know which alignments are true joins, but we can use 'joinable' alignments (those that pass basic checks for joining) as an approximation.

It's been found experimentally (looking at examples) that the above distributions are fit by the following families of distributions:

$P(U | J)$ is exponentially distributed.
$P(U | \neg J)$ is exponentially distributed, with a higher average value.
$P(D | J)$ follows a half-normal distribution, or basically the non-absolute value differences are distributed normally with mean at 0.
$P(D | \neg J)$ is also half-normally distributed.
$P(C | J)$ is Frechet distributed with a mode close to 0. Fitting a Fetchet directly with naive methods leads to poor results, so a Assymetric Laplace with the mode fixed at 0 is fit instead, which gives a good results.
$P(C | \neg J)$ is the only distribution that does not follow any specific family. Specifically, every family has a different distribution. Some follow a Laplace, some a bump-shaped distribution, others follow more uniform like distributions, others are bi-modal, etc. Currently, a Laplace is fit to all families, but it would be nice to revisit this to find a more general distribution that can better represent the variety of distributions seen.

Once the first pass is done, we run a second pass over each region where we score possible joins based on the estimated distributions we have. Specifically, given a pair of segments from a family, with $u$ unexplained bases, a consensus distance of $c$, and an absolute divergence difference of $d$:

$$ \large{ P(J_f \mid C = c \cap U = u \cap D = d) \propto P(C = c \mid J_f) \cdot P(U = u \mid J_f) \cdot P(D = d \mid J_f) \cdot P(J_f) } $$

and...

$$ \large{ P(\neg\ J_f \mid C = c \cap U = u \cap D = d) \propto P(C = c \mid\neg\ J_f) \cdot P(U = u \mid\neg\ J_f) \cdot P(D = d \mid\neg\ J_f) \cdot P(\neg\ J_f) } $$

These both are derived from Bayes Theorem, and assume the conditioned distributions can be treated independently (probably not true, but close enough, and also makes the computation tractable). Normalizing the right hand sides above by their sum (since they cover the full range of possiblities) gives us the actual probabilities. Once we have the probability of it being a join, we need to convert it to a score. To do so, we currently use the expected value of the score for the join in aurora, computed as below:

$$\large{ S_{J_f} = \mathbb{E}[S] = S_l \cdot p_{J_f} + S_j \cdot (1 - p_{J_f}) }$$

Where:

$S_{J_f}$ is the transition score for the join.
$S_l$ is the transition score for remaining on the same sequence (query loop score).
$S_j$ is the transition score for jumping to another sequence (query jump score).
$p_{J_f}$ is the probability of the join, or $P(J_f \mid C = c \cap U = u \cap D = d)$

Once joins have been scored, we can just use the scores normally durring history tracing. And that's it!

Other changes include to this version of aurora include:

Ability to highlight inversions in aurora visualization
Enhaced index and table pages in the visualization output (pagnation for table page reduces lag drastically).
Family statistics table page includes new violin plots for showing the distributions discussed above.
Enhanced/more informative error messages for file parsing code.
Slightly better default parameters for runs.
A bunch of important bugfixes.

to use in linking. Adjust max consensus distance to match what is used in repeat masker

joins.

Very clear bug in history backtrace.

is made.

family page.

isaacrobinson2000 added 30 commits April 8, 2026 15:32

Fix warnings, bug in visual...

6b2ee0b

Fix index page html.

67a1100

Further index page fixes.

c2bf764

Gather basic global trace statistics.

32e8d23

New statistics module, incorperating into linking...

e62c280

Fix errors.

decf232

remove unused imports

2a39a85

New target scoring scheme is done, also added unexplained base removal.

6ce0e30

Remove unexplained gaps.

cf319e3

Fix reading of bed files.

4aab08d

Add divergence calculation for segments,

03c24b4

to use in linking. Adjust max consensus distance to match what is used in repeat masker

move consensus distance out into it's own function.

8ba72eb

Work on stat extraction from aurora.

75a319a

Work on parameter analysis for

4ce5845

joins.

Further exploration of join distributions.

b3fc5e0

Join estimation WIP

0d2bba6

WIP Half t distribution.

0601c8b

Init impl of HalfT, needs testing...

597a260

Start adding tests for distributions.

daa1953

P2 Estimator wip for doing quantile estimators.

a0236ac

Init p2 est done..

aba5104

First refactor with new scoring done...

88c0a9f

Target distance and divergence bayesian scoring working...

a8e923c

Very clear bug in history backtrace.

Bug fix: properly adjust history when join

88dabb8

is made.

Tested est for frechet, stinks so temp disabled.

8a3e268

WIP on quantile estimator.

c065af2

New quantile estimator updates done.

9c261d3

More work on quantile estimation.

958b3da

Quantile est working, need to fix merging.

44fde42

Final touches on quantile estimator...

117fcfc

isaacrobinson2000 added 9 commits May 27, 2026 18:08

Better fretchet fitting, need to test..

1a3b165

Test gumbel...

2f7c9f5

Normalize by consensus.

52dc778

New version working...

bce495e

Overhaul table viz to work better with very large runs.

22654e4

Clean up implementation...

5e40974

Enhance statistics displayed on

06923ff

family page.

Parameter tuning...

d678818

Better error handling.

9175b6a

isaacrobinson2000 assigned jackroddy Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhanced Join Statistics#13

Enhanced Join Statistics#13
isaacrobinson2000 wants to merge 39 commits into
devfrom
update_to_latest

isaacrobinson2000 commented Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

isaacrobinson2000 commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

isaacrobinson2000 commented Jun 10, 2026 •

edited

Loading