Question

deciding --max-depth for RNAseq variant calling

0

Entering edit mode

7.1 years ago

prasundutta87 ▴ 660

Hi,

I have data from RNAseq experiment with a coverage of 100 million reads. Samtools mpileup has an option of --max-depth which means 'At a position, read maximally INT reads per input file'. Its default value is 250 and can be increased upto 1000000 (samtools v1.2). There is a limit/cap/default in this to limit memory leak. How do we decide this value for a variant calling experiment?

I am aware that a combindation of 8 identical reads that cover the location of a variant will produce a strongly supported variant call. So is using default fine or based on what information should I decide its value? Furthermore, is there an easy way to visulaize/check if the coverage is consistent across the genome?

RNA-Seq alignment sequence • 2.7k views

ADD COMMENT • link 7.1 years ago by prasundutta87 ▴ 660

0

Entering edit mode

I have data from RNAseq experiment with a coverage of 100 million reads.

Due to very uneven coverage of the 'genome' by RNA-seq this 100M is more commonly referred to as 'number of reads' rather than coverage.

Furthermore, is there an easy way to visulaize/check if the coverage is consistent across the genome?

You are using RNA-seq, so per definition the coverage is not consistent across the genome. If that's something you wonder about I guess you need to read more about RNA-seq before continuing this analysis.

ADD REPLY • link 7.1 years ago by WouterDeCoster 47k

0

Entering edit mode

Thank you for your answers. I am aware that coverage won't be consistent because genes expressing higher will have higher stack of reads aligned to them. I wanted to ask how high should I choose the threshold to capture the highly expressed genes as well without landing to any technical issue. But I got my answer. Thank you very much.

ADD REPLY • link 7.1 years ago by prasundutta87 ▴ 660

1

Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

ADD REPLY • link 7.1 years ago by GenoMax 142k

0

Entering edit mode

Another comment would be why are you using RNASeq to do variant calling? DNA reads are generally used.

ADD REPLY • link 7.1 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

I am trying to analyse allele specific expression.I have DNA sequence as well..will check for concordance/discordance between the data I will be getting from both variant calls.

ADD REPLY • link 7.1 years ago by prasundutta87 ▴ 660

0

Entering edit mode

There are specialized tools for allele specific expression, I would suggest to give those a spin.

ADD REPLY • link 7.1 years ago by WouterDeCoster 47k

score 2 · Accepted Answer · 2017-04-03

How do we decide this value for a variant calling experiment?

The best way to decide threshold in bioinformatics, as far as I know, is always to plot the distribution of the values theirselves. In this case I assume you have a Bam file ready, where you selected the reliable reads (idk, like proper pairs or whatever filtering criterion your project requires). You can extract coverage from this file using many scripts that you will find in the net, one that comes to my mind is Bam2wig. There are many. Then, you can check the distribution of the coverage with a plot, and decide your upper and lower threshold for variant calling.

is there an easy way to visulaize/check if the coverage is consistent across the genome?

EDIT: I associate myself with @Wouter in his answer. Due to gene expression, you shouldn't have consistency in coverage.