Question

Is there a way to run samtools sort, index, and view on multiple files at once?

0

Entering edit mode

7.8 years ago

ely.j.fish • 0

I have a large number of .bam files, all sorted and indexed using samtools, so that the folder contains files in .bam, .sorted.bam, and .sorted.bam.bai formats. I need to extract some data about number of reads, both total and mitochondrial. Basically, I need to run the following commands:

samtools view -c XYZ.bam

samtools view -c XYZ.bam chrM

samtools view -c -q 30 XYZ.bam

samtools view -c -q 30 XYZ.bam chrM

Where chrM is the mitochondrial chromosome. My question is: is there a script that will allow me to run these commands on all the files at once, and export the resulting read numbers into a libreoffice spreadsheet? Also, is there a way to sort and index a lot of bam files with a single command? I'm totally new to programming, so I apologize if these questions are either insanely easy or ridiculously complicated. I would just really prefer not to have to run these commands and copy-paste 300-odd times, which is what I've been doing thus far.

next-gen sequencing sequence samtools • 3.0k views

ADD COMMENT • link updated 7.8 years ago by Zev.Kronenberg 12k • written 7.8 years ago by ely.j.fish • 0

score 2 · Answer 1 · 2016-06-29

To answer your question directly, you can put your commands in a shell script which loops over the files like so:

for f in /path/to/bams/*.bam
do
 echo "Processing $f"
 samtools view -c $f 
 samtools view -c $f chrM
 samtools view -c -q 30 $f 
 samtools view -c -q 30 $f chrM
done

However, this method has 4 downsides:

You're not working in parallel - you'll only use 1 cpu to do all the processing. This can be fixed with a more complicated shell script though.
You read over every file 4 times to calculate what could be gained in a single pass, so it's going to be 4x slower than ideal.
The results you'll get out won't be formatted nicely for a libreoffice table.
If there's another BAM file on your hard drive tomorrow, you'll have to copy/paste some stuff, which is always something to best avoid if you can.

Alternatively you can use a script i wrote called SeQC, but you'll have to use/learn a bit of SQL to get the results you want in a nice table. First download it from here: https://github.com/JohnLonginotto/SeQC

Then run the command: python /path/to/SeQC.js.py --analysis MAPQ RNAME --input /path/to/bam1.bam /path/to/bam2.bam /path/to/all/bams/*.bam --output /path/to/output/stats.sql

You can also pop in a --cpu X in the above if you like to run over however many CPUs you have on your system. This will calculate the reads per chromosome per MAPQ, so will look something like:

MAPQ        RNAME       counts
----------  ----------  ----------
3           chr16       704290
1           chr4        134562
255         chr9        505645
255         chr19       347598
1           chr17       188383
3           chr1        239878
...

Then you can sum the reads for chrM, or sum the reads for chrM where the MAPQ > 30, or whatever you want, and it will be near-instantaneous after the sql table has been made. If you decide to go down that route, i'll help you with the SQL.

score 2 · Answer 2 · 2016-06-29

2

Entering edit mode

7.8 years ago

Zev.Kronenberg 12k

Snakemake

https://bitbucket.org/snakemake/snakemake/wiki/Home

ADD COMMENT • link 7.8 years ago by Zev.Kronenberg 12k