Question

Calculate Per Exon/Per Gene Coverage

10

Entering edit mode

11.2 years ago

thecuriousbiologist ▴ 550

Hi,

I have a list of exons (in BED format) and I wish to find the coverage per exon from a BAM file. I was looking at Bedtools coverageBED but I don't think that gives me the per exon coverage.

Are there any suggestions on how I should go about calculating the coverage of each exon?

Also, eventually, I have to also calculate per gene coverage. How do I go about calculating the per gene coverage from the per exon coverage obtained above ? Do I simply take the average of the coverage across all exons of the gene?

coverage gene exon • 26k views

ADD COMMENT • link updated 11 months ago by anii • 0 • written 11.2 years ago by thecuriousbiologist ▴ 550

0

Entering edit mode

What type of data are these? RNA-seq? Exome capture? Genomic DNA? And what are you going to do with the coverage numbers?

ADD REPLY • link 11.2 years ago by Sean Davis 26k

Devon Ryan · Answer 1 · 2013-01-24

12

Entering edit mode

11.2 years ago

Irsan ★ 7.8k

bedtools coverage -abam sample.bam -b exons.bed -counts

should give you the amount of reads that in you alignment for each region in your bed file. So if each region in your bed is an exon than you have coverage numbers for each exon. If you want to have it per gene than download a bed-file with RefSeq genes from UCSC table browser and do

bedtools coverage -abam sample.bam -b genes.bed -counts

You can get the genes.bed file by going to UCSC table browser, then

clade = mammal
genome = human
assembly = hg19
group = gene and gene prediction tracks
track = RefSeq genes
table = kgXref
output format = select fields from primary and selected tables
output file = genes.bed

press "get output"

select

geneSymbol
knownGene

press "allow selection from checked tables"

select

geneSymbol
chrom
txStart
txEnd

and finally press "get output"

ADD COMMENT • link updated 2.3 years ago by Devon Ryan 104k • written 11.2 years ago by Irsan ★ 7.8k

0

Entering edit mode

Hello Irsan, I followed you answer and got the counts for each feature but I have multiple row for each feature and I don't know if I have to take average of all or just the one with the highest count. Can you guide me what I should do?? my header of count file is like:

1       67092164        67231852        C1orf141        12
1       67092175        67127261        C1orf141        2
1       67092175        67127261        C1orf141        2
1       67092396        67127261        C1orf141        2
1       201283451       201332993       PKP1    141
1       201283451       201332993       PKP1    141
1       201283511       201330288       PKP1    72
1       201283702       201328836       PKP1    69

ADD REPLY • link 7.8 years ago by Bioinformatist Newbie ▴ 270

0

Entering edit mode

How did you resolve this issue? I am also stuck in this problem. Could you please help me to solve this issue?

ADD REPLY • link 11 months ago by anii • 0

0

Entering edit mode

@Irsan Can you help me here..?

ADD REPLY • link 7.8 years ago by Bioinformatist Newbie ▴ 270

0

Entering edit mode

If you need average coverage to all regions just pipe your output to simply awk script like -

awk  '{sum+=$5}END{Print "Average coverage is:", sum/NR}'

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 7.6 years ago by Paul ★ 1.5k

Ram · Answer 2 · 2013-01-24

5

Entering edit mode

11.2 years ago

Pierre Lindenbaum 161k

I wrote that tool ~~https://code.google.com/p/variationtoolkit/wiki/BedDepth~~ https://github.com/lindenb/jvarkit/wiki/BamStats04 it takes a bed and a set of BAMs and output a few statistics about the regions of interest:

-f (bam-file) add this bam file. Can be called multiple times
-m (min-qual uint32) (optional) min SAM record Quality.
-M (max-qual uint32) (optional) max SAM record Quality.
-D (min-depth) (optional) min depth. Can be called multiple times
-d do NOT discard duplicates

$ echo -e "seq2\t100\t200\nseq1\t0\t10\nseq2\t1500\t1550" |\
    beddepth -D 5 -D 100 -D 200 -f  sorted.bam -f ex1.bam |\
    verticalize

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 11.2 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Hello,

I compiled your package and tried it:

java -jar ~/programs/jvarkit/dist/bamstats04.jar BED=gdr18_k75-85_NHC_ORFs-all.bed I=GDR-18.sort.grp.md.bam > GDR18.coverage.out
[Sat Jun 07 21:27:02 EDT 2014] com.github.lindenb.jvarkit.tools.bamstats04.BamStats04 IN=GDR-18.sort.grp.md.bam BEDILE=gdr18_k75-85_NHC_ORFs-all.bed    NO_DUP=true NO_ORPHAN=true NO_VENDOR=true MMQ=0 MIN_COVERAGE=0 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Sat Jun 07 21:27:02 EDT 2014] Executing as adrian@biopower2 on Linux 2.6.32-358.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_09-icedtea-mockbuild_2013_01_16_18_52-b00; Picard version: null JdkDeflater
INFO    2014-06-07 21:27:02     BamStats04      Scanning GDR-18.sort.grp.md.bam
ERROR   2014-06-07 21:27:02     BamStats04
java.lang.IllegalArgumentException: Invalid reference index -1
        at htsjdk.samtools.QueryInterval.<init>(QueryInterval.java:24)
        at htsjdk.samtools.SAMFileReader.query(SAMFileReader.java:397)
        at htsjdk.samtools.SAMFileReader.queryOverlapping(SAMFileReader.java:418)
        at com.github.lindenb.jvarkit.tools.bamstats04.BamStats04.doWork(BamStats04.java:93)
        at com.github.lindenb.jvarkit.util.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:178)
        at com.github.lindenb.jvarkit.tools.bamstats04.BamStats04.main(BamStats04.java:189)
[Sat Jun 07 21:27:02 EDT 2014] com.github.lindenb.jvarkit.tools.bamstats04.BamStats04 done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=2058027008

Any ideas on how to fix this?

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by Adrian Pelin ★ 2.6k

0

Entering edit mode

the chromosomes defined in your BED and in GDR-18.sort.grp.md.bam are not the same.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Alright, makes sense. I fixed that, new problem now:

java -jar ~/programs/jvarkit/dist/bamstats04.jar BED=gdr18_k75-85_NHC_conc-allORFs.bed I=GDR-18.sort.grp.md.bam > GDR18.coverage.out
[Mon Jun 09 17:09:06 EDT 2014] com.github.lindenb.jvarkit.tools.bamstats04.BamStats04 IN=GDR-18.sort.grp.md.bam BEDILE=gdr18_k75-85_NHC_conc-allORFs.bed    NO_DUP=true NO_ORPHAN=true NO_VENDOR=true MMQ=0 MIN_COVERAGE=0 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Mon Jun 09 17:09:06 EDT 2014] Executing as adrian@biopower2 on Linux 2.6.32-358.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_09-icedtea-mockbuild_2013_01_16_18_52-b00; Picard version: null JdkDeflater
INFO    2014-06-09 17:09:06     BamStats04      Scanning GDR-18.sort.grp.md.bam
ERROR   2014-06-09 17:09:08     BamStats04
java.lang.IllegalStateException: Inappropriate call if not paired read
        at htsjdk.samtools.SAMRecord.requireReadPaired(SAMRecord.java:654)
        at htsjdk.samtools.SAMRecord.getProperPairFlag(SAMRecord.java:662)
        at com.github.lindenb.jvarkit.tools.bamstats04.BamStats04.doWork(BamStats04.java:100)
        at com.github.lindenb.jvarkit.util.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:178)
        at com.github.lindenb.jvarkit.tools.bamstats04.BamStats04.main(BamStats04.java:189)
[Mon Jun 09 17:09:08 EDT 2014] com.github.lindenb.jvarkit.tools.bamstats04.BamStats04 done. Elapsed time: 0.04 minutes.
Runtime.totalMemory()=2058027008

Now I am not sure how to fix this, but what happend is that since some of my PE reads were overlapping, I merged them. Then I mapped (bwa) the SE and PE reads and merged the 2 bam files into one. Is that why there is an error?

Thanks.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by Adrian Pelin ★ 2.6k

0

Entering edit mode

You're right. I've fixed the code to handle non-paired-ends, thanks:

https://github.com/lindenb/jvarkit/commit/72c4f302902e44df15f20a1d181cb551ef9dba37#diff-8d505bb25f8d3b768022df93d1527778

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by Pierre Lindenbaum 161k

score 0 · Answer 3 · 2013-01-23

0

Entering edit mode

11.2 years ago

Ashutosh Pandey 12k

I use GATK DepthofCoverage for calculating coverage. You can provide multiple files (one for exons, one for genes) and it will calculate the coverage for you. Gene coverage could be different from average across all the exons. Reads may be aligned to Introns and you may miss them if you only include exons. Link GATK: http://gatkforums.broadinstitute.org/discussion/40/depthofcoverage-v3-0-how-much-data-do-i-have

ADD COMMENT • link 11.2 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Be careful when using GATK DepthofCoverage. It has filters with no documentation, which filter out some of your reads without any explained mechanism. Sometimes those filtered-out reads may be important, but the GATK has no option to deactivate the filters.

ADD REPLY • link 9.5 years ago by earthworm • 0

score 0 · Answer 4 · 2013-01-24

0

Entering edit mode

11.2 years ago

biorepine ★ 1.5k

It is best to calculate normalized expression for each exon. This script would do the job. Good luck!

ADD COMMENT • link 11.2 years ago by biorepine ★ 1.5k