Question

How do you remove coverage outliers in sequencing count data?

0

Entering edit mode

8.0 years ago

biohack92 ▴ 170

I've recently looked at methylation coverage (bisulfite seq data) in IGV, and there are obvious coverage outliers which I'm interpreting as mismappings of repetitive sequences. How do you identify these outliers from methylation calls/count data (I don't have BAM/SAM files) and remove them?

bisulfite sequencing qc R quality control • 2.9k views

ADD COMMENT • link updated 8.0 years ago by dariober 14k • written 8.0 years ago by biohack92 ▴ 170

score 3 · Answer 1 · 2016-04-14

3

Entering edit mode

8.0 years ago

dariober 14k

The quantile function in R is handy in these cases. To discard datapoints in the top 1% (and keep the bottom 99%) do:

xv<- rnbinom(10000, 1, 0.1)     ## Test data
qqcut<- quantile(xv, p= 0.99)   ## Cut off to use in R or awk
xvcut<- xv[xv < qqcut]          ## Keep only points in lower 99%

ADD COMMENT • link 8.0 years ago by dariober 14k

score 1 · Answer 2 · 2016-04-13

1

Entering edit mode

8.0 years ago

Devon Ryan 104k

Load the coverage into R, plot the distribution, and then use awk to filter the files given some reasonable threshold given the coverage distribution.

ADD COMMENT • link 8.0 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks @Devon Ryan. I followed your advice and this is the plot I created. X-axis shows the # of reads/counts and Y-axis is the frequency of each count. Is there a way to determine which threshold is 'reasonable'?

ADD REPLY • link 8.0 years ago by biohack92 ▴ 170

3

Entering edit mode

You could make a qq-plot, but it's generally OK to just eye-ball things. From the distribution, it looks like a threshold around 200 would be reasonable.

ADD REPLY • link 8.0 years ago by Devon Ryan 104k