Skip to content

Enhanced Join Statistics#13

Open
isaacrobinson2000 wants to merge 39 commits into
devfrom
update_to_latest
Open

Enhanced Join Statistics#13
isaacrobinson2000 wants to merge 39 commits into
devfrom
update_to_latest

Conversation

@isaacrobinson2000

@isaacrobinson2000 isaacrobinson2000 commented Jun 10, 2026

Copy link
Copy Markdown
Member

Adds enhanced join scoring to aurora, based directly on the annotation data recieved.

The new technique involves a 2-pass algorithm.
In the first pass, for every family, we identify all joinable and not joinable alignments in the same family, using basic thresholds. Once we've done so, we estimate the following distributions per-family:

$$\large{ P(C \mid J_f),\quad P(C \mid \neg\ J_f),\quad P(U \mid J_f),\quad P(U \mid \neg\ J_f),\quad P(D \mid J_f),\quad P(D \mid \neg\ J_f), \quad P(J_f),\quad P(\neg\ J_f) }$$

Where:

  • $J_f$ is the set of joined alignments for a given consensus sequence, or family.
  • $C$ is the set of consensus distances.
  • $U$ is the set of unexplained bases between alignments (same as target distance, but exclude alignments in-between).
  • $D$ is the set of absolute divergence differences (differences of kimura80 values) between alignments.

Notice, we don't know which alignments are true joins, but we can use 'joinable' alignments (those that pass basic checks for joining) as an approximation.

It's been found experimentally (looking at examples) that the above distributions are fit by the following families of distributions:

  • $P(U | J)$ is exponentially distributed.
  • $P(U | \neg J)$ is exponentially distributed, with a higher average value.
  • $P(D | J)$ follows a half-normal distribution, or basically the non-absolute value differences are distributed normally with mean at 0.
  • $P(D | \neg J)$ is also half-normally distributed.
  • $P(C | J)$ is Frechet distributed with a mode close to 0. Fitting a Fetchet directly with naive methods leads to poor results, so a Assymetric Laplace with the mode fixed at 0 is fit instead, which gives a good results.
  • $P(C | \neg J)$ is the only distribution that does not follow any specific family. Specifically, every family has a different distribution. Some follow a Laplace, some a bump-shaped distribution, others follow more uniform like distributions, others are bi-modal, etc. Currently, a Laplace is fit to all families, but it would be nice to revisit this to find a more general distribution that can better represent the variety of distributions seen.

Once the first pass is done, we run a second pass over each region where we score possible joins based on the estimated distributions we have. Specifically, given a pair of segments from a family, with $u$ unexplained bases, a consensus distance of $c$, and an absolute divergence difference of $d$:

$$ \large{ P(J_f \mid C = c \cap U = u \cap D = d) \propto P(C = c \mid J_f) \cdot P(U = u \mid J_f) \cdot P(D = d \mid J_f) \cdot P(J_f) } $$

and...

$$ \large{ P(\neg\ J_f \mid C = c \cap U = u \cap D = d) \propto P(C = c \mid\neg\ J_f) \cdot P(U = u \mid\neg\ J_f) \cdot P(D = d \mid\neg\ J_f) \cdot P(\neg\ J_f) } $$

These both are derived from Bayes Theorem, and assume the conditioned distributions can be treated independently (probably not true, but close enough, and also makes the computation tractable). Normalizing the right hand sides above by their sum (since they cover the full range of possiblities) gives us the actual probabilities. Once we have the probability of it being a join, we need to convert it to a score. To do so, we currently use the expected value of the score for the join in aurora, computed as below:

$$\large{ S_{J_f} = \mathbb{E}[S] = S_l \cdot p_{J_f} + S_j \cdot (1 - p_{J_f}) }$$

Where:

  • $S_{J_f}$ is the transition score for the join.
  • $S_l$ is the transition score for remaining on the same sequence (query loop score).
  • $S_j$ is the transition score for jumping to another sequence (query jump score).
  • $p_{J_f}$ is the probability of the join, or $P(J_f \mid C = c \cap U = u \cap D = d)$

Once joins have been scored, we can just use the scores normally durring history tracing. And that's it!

Other changes include to this version of aurora include:

  • Ability to highlight inversions in aurora visualization
  • Enhaced index and table pages in the visualization output (pagnation for table page reduces lag drastically).
  • Family statistics table page includes new violin plots for showing the distributions discussed above.
  • Enhanced/more informative error messages for file parsing code.
  • Slightly better default parameters for runs.
  • A bunch of important bugfixes.

to use in linking. Adjust max consensus distance to match what is used
in repeat masker
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants