Is there a way to run samtools sort, index, and view on multiple files at once?
2
0
Entering edit mode
7.8 years ago
ely.j.fish • 0

I have a large number of .bam files, all sorted and indexed using samtools, so that the folder contains files in .bam, .sorted.bam, and .sorted.bam.bai formats. I need to extract some data about number of reads, both total and mitochondrial. Basically, I need to run the following commands:

samtools view -c XYZ.bam

samtools view -c XYZ.bam chrM

samtools view -c -q 30 XYZ.bam

samtools view -c -q 30 XYZ.bam chrM

Where chrM is the mitochondrial chromosome. My question is: is there a script that will allow me to run these commands on all the files at once, and export the resulting read numbers into a libreoffice spreadsheet? Also, is there a way to sort and index a lot of bam files with a single command? I'm totally new to programming, so I apologize if these questions are either insanely easy or ridiculously complicated. I would just really prefer not to have to run these commands and copy-paste 300-odd times, which is what I've been doing thus far.

next-gen sequencing sequence samtools • 3.0k views
ADD COMMENT
2
Entering edit mode
7.8 years ago
John 13k

To answer your question directly, you can put your commands in a shell script which loops over the files like so:

for f in /path/to/bams/*.bam
do
 echo "Processing $f"
 samtools view -c $f 
 samtools view -c $f chrM
 samtools view -c -q 30 $f 
 samtools view -c -q 30 $f chrM
done

However, this method has 4 downsides:

  1. You're not working in parallel - you'll only use 1 cpu to do all the processing. This can be fixed with a more complicated shell script though.
  2. You read over every file 4 times to calculate what could be gained in a single pass, so it's going to be 4x slower than ideal.
  3. The results you'll get out won't be formatted nicely for a libreoffice table.
  4. If there's another BAM file on your hard drive tomorrow, you'll have to copy/paste some stuff, which is always something to best avoid if you can.

Alternatively you can use a script i wrote called SeQC, but you'll have to use/learn a bit of SQL to get the results you want in a nice table. First download it from here: https://github.com/JohnLonginotto/SeQC

Then run the command: python /path/to/SeQC.js.py --analysis MAPQ RNAME --input /path/to/bam1.bam /path/to/bam2.bam /path/to/all/bams/*.bam --output /path/to/output/stats.sql

You can also pop in a --cpu X in the above if you like to run over however many CPUs you have on your system. This will calculate the reads per chromosome per MAPQ, so will look something like:

MAPQ        RNAME       counts
----------  ----------  ----------
3           chr16       704290
1           chr4        134562
255         chr9        505645
255         chr19       347598
1           chr17       188383
3           chr1        239878
...

Then you can sum the reads for chrM, or sum the reads for chrM where the MAPQ > 30, or whatever you want, and it will be near-instantaneous after the sql table has been made. If you decide to go down that route, i'll help you with the SQL.

ADD COMMENT
2
Entering edit mode
ADD COMMENT

Login before adding your answer.

Traffic: 1951 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6