Question

Finding mutational burden in TCGA dataset

0

Entering edit mode

3.1 years ago

Jamie Watson ▴ 10

I have list of SNVs in TCGA dataset and I want to find mutation burden in TCGA dataset mapped to 5'UTR. So for each SNV/variant I look 5kb upstream and 5kb downstream. And in this local window of 10kb I identify the position the original position where I want to randomly shuffle the original position. And when I do that I make sure that tri-nucleotide context is preserved. So the goal is to identify a location in this 10kb window excluding the original position where that variant is observed while taking into account the nucleotide on the right and the one on the left. So this will be 1 random iterations. And I do 100 such random iterations. For example, if there is a mutation in chromosome 10 on position 20000 (C>G) then I make a local window around that variant such that there are 5000 bases upstream and 5000 bases downstream. Now in that local window I identify a position where C is present such the trinucleotide context of original position is preserved. This is one iteration and I repeat it 100 times to identify multiple position.

I have a code that does the above mentioned stuff.

Now in order to calculate p-value I want to find the fraction of nucleotide that fall in original position compared to local window. How can I achieve that? Insights will be appreciated.

genome python UTR tcga cancer • 461 views

ADD COMMENT • link 3.1 years ago by Jamie Watson ▴ 10