Just need to learn to do one thing: counting the reads that map to the human mitochondrial genome
1
0
Entering edit mode
8.6 years ago

This is my scenario: I am doing whole genome amplification from human cells, and then doing NGS with an Illumina platform. I get bam files from the machine. I am trying to find out how many times the mitochondrial DNA is represented in each sample. I know that mitochondrial DNA gets amplified during my whole genome amplification (I confirmed with the manufacturer of the kit). I've gotten to the point where I downloaded IGV and can map the reads to the human mitochondrial genome. I can see the mapped reads. But I cannot figure out how to get a specific value, i.e. a count of the specific number of mapped reads in each sample. The idea is to be able to tell if one sample had more mitochondrial DNA than another sample.

I am working on a Windows 7 64-bit computer. (But could have access to a mac as well)

Thanks so much any help is greatly appreciated!

next-gen-sequencing alignment • 5.0k views
ADD COMMENT
1
Entering edit mode

can map the reads to the human mitochondrial genome

Hope you are mapping reads to other parts of the genome as well? Meaning: whole (human?) genome. Because if all what you got are BAM files with reads mapped to just 16kb MT, the number of reads mapped will not tell you much about number of MT per cell. For that you will need a number of reads mapped to all chromosomes and to MT.

Also: Which mapper?

ADD REPLY
0
Entering edit mode
Agreed. You want to do copy number analyses and it helps a lot when you can relate the mitochodrial counts to the other chromosomes. For example you could calculate the log-ratio of chrM VS rest of the genome and use that as your test statistic
ADD REPLY
0
Entering edit mode

Yes, the Illumina-sponsored software I am using gives me an absolute number of reads mapped to autosomes. So I was going to divide the number of reads to mitochondrial DNA by the number I get for autosomal mapped reads. The Mapper is called Bluefuse. I can map all my reads to the human genome, but I don't have the option to map to the mitochondrial genome. That would have been easiest.

ADD REPLY
0
Entering edit mode

I don't have the option to map to the mitochondrial genome

Because that would be pointless (unless your question is: did I got mitochondrial DNA in my sample?). The input FASTQ files differ in the total number of reads and in their quality. The two numbers: MT_mapped_in_A vs MT_mapped_in_B (assuming that taking 16kb out of 3bilons bases will not screw up the mappings). So you can have A tissue rich with mitochondria but with low DNA yield and really crappy sequencing vs B tissue having lower number of MT per cell, but on which the sun was shining the day(s) it was prepared and sequenced. So you got MT_mapped_in_B > MT_mapped_in_A. Not so great, no?

ADD REPLY
0
Entering edit mode

I am trying to understand the point you are making. Would it be ok to do the comparison if I use BAM files only? For example SampleA_Reads in MT/SampleA_Reads in Human Genome VERSUS SampleB_Reads in MT/SampleB_Reads in Human Genome. If the first is a larger number than the second, I can assume there is more mtDNA in the first sample, correct?

ADD REPLY
0
Entering edit mode

Yes, you are right, this will give you the idea about the numbers of MT/cell in samples. If you want to get exact (well..) numbers, you will have to take into account the size of nuclear genome (2x3000Mb or rather 2x what is the non-MT contigs length in your BAM header SQ lines) vs MT (16kb). It is a bit more complicated (look for Genomic alignment and mappability), but better lets solve one issue at the time.

MT sequence may have different names depending which genome version you are using. Are you able to load any of your BAM files (preferably the smallest) in IGV on your Windows, and check how chromosomes/MT sequences are named? Meaning i.e: chr1 vs 1, and chrMT vs MT.

ADD REPLY
0
Entering edit mode

Probably booting your PC using LiveCD with Ubuntu/Lubuntu Linux, installing samtools using Synaptic (or apt-get) and processing your BAM files i.e. located on some USB drive will be hopefully the fastest and easiest route. Just check if BAMs you got out of Bluefuse have been sorted and indexed (files with extension .bai). If not, give us the total number of BAMs and their average size. Because using some social skills and getting access to some Linux server/workstation makes more sense than fighting with terabyte of results on i.e 5 years old laptop for days on end.

In any case, google/check once on Linux with samtools installed:

samtools view -H your_A.bam | less
samtools idxstats your_A.bam | less

(it will give you the text output for all contigs/chromosomes).

You will need a bit of command-line scripting for processing say more than 10 BAMs and few installed on Linux utilities (grep, awk) to grab the numbers from the samtools output.

ADD REPLY
0
Entering edit mode

I checked, and from Bluefuse I three files for each sample: .fastq, .bam and .bam.bai

Form your comment above, I gather that using the .bam.bai would be best

ADD REPLY
0
Entering edit mode

You need both. Usage: samtools idxstats <in.bam>

Samtools will look for in.bam.bai See: http://www.htslib.org/doc/samtools.html

ADD REPLY
3
Entering edit mode
8.6 years ago

Given a sorted BED file called extents.bed containing the genomic bounds of chromosomes for a genomic build (here is an example for hg19), you can use BEDOPS bedmap --count with --chrom chrM to get the number of reads over the mitochrondrial DNA:

$ bedmap --count --chrom chrM extents.bed <(bam2bed < readsA.bam) > answerA.txt

You could repeat this for each sample BAM file and compare answers:

$ bedmap --count --chrom chrM extents.bed <(bam2bed < readsB.bam) > answerB.txt
$ bedmap --count --chrom chrM extents.bed <(bam2bed < readsC.bam) > answerC.txt
...

Once you have all the answer-files, you can sort them by counts into a final table:

$ awk '{ print FILENAME"\t"$1; }' answer*.txt | sort -n -k2 -r - > finalAnswer.txt

Doing work on the command line is pretty powerful and if you are a beginner, I recommend taking an hour or two to learn some of the basics here. There are very few turnkey applications that solve problems, but a practically infinite number of ways to solve things with the command line.

ADD COMMENT
0
Entering edit mode

Thanks Alex, but being a newbie this sounds a little technical. What program do I need to download to do what you are suggesting?

ADD REPLY
1
Entering edit mode

The jump you are trying to make, from a newbie to counting reads is a very big one. If you really want this fast, and insist on sticking to Windows, a short term solution is to get a Geneious software trial, import the reads and map them to human mtDNA. The trial would give you 31 days to do it.

ADD REPLY
0
Entering edit mode

The toolkit is called BEDOPS. It contains tools called bedops and bedmap that do efficient set and map operations on BED-formatted files. Among other features, it also offers convert2bed, which converts files in other genomic formats (like BAM) to the BED format, so that set and map operations can be done on those files. You would need to run it on the command line, under a Linux or Mac OS X installation. It does not run on Windows. The site I linked to has an overview of the toolkit, along with documentation for each tool.

ADD REPLY

Login before adding your answer.

Traffic: 1837 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6