Enhanced Join Statistics#13
Open
isaacrobinson2000 wants to merge 39 commits into
Open
Conversation
to use in linking. Adjust max consensus distance to match what is used in repeat masker
Very clear bug in history backtrace.
family page.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds enhanced join scoring to aurora, based directly on the annotation data recieved.
The new technique involves a 2-pass algorithm.
In the first pass, for every family, we identify all joinable and not joinable alignments in the same family, using basic thresholds. Once we've done so, we estimate the following distributions per-family:
Where:
Notice, we don't know which alignments are true joins, but we can use 'joinable' alignments (those that pass basic checks for joining) as an approximation.
It's been found experimentally (looking at examples) that the above distributions are fit by the following families of distributions:
Once the first pass is done, we run a second pass over each region where we score possible joins based on the estimated distributions we have. Specifically, given a pair of segments from a family, with$u$ unexplained bases, a consensus distance of $c$ , and an absolute divergence difference of $d$ :
and...
These both are derived from Bayes Theorem, and assume the conditioned distributions can be treated independently (probably not true, but close enough, and also makes the computation tractable). Normalizing the right hand sides above by their sum (since they cover the full range of possiblities) gives us the actual probabilities. Once we have the probability of it being a join, we need to convert it to a score. To do so, we currently use the expected value of the score for the join in aurora, computed as below:
Where:
Once joins have been scored, we can just use the scores normally durring history tracing. And that's it!
Other changes include to this version of aurora include: