Biostar Beta. Not for public use.
Insert Size For Illumina Gaiix Paired-End Library From Sam/Bam File
0
Entering edit mode
12 months ago
bioinfo • 700
New Zealand

From the fastq data (read 1 and read 2) from illumina GAIIx platform ( paired-end library), I created the Sam and bam file using BWA. I got the statistics of number of uniquely-paired reads and total reads mapped to my reference genome. I was wondering how to get a complete list of insert size across the data set from my Sam or bam file? It could be just for uniquely paired reads or all mapped reads. I don't want to write a long script for that. Is there any easy commands from samtools or other options that I can use?

ADD COMMENTlink
0
Entering edit mode

BWA's command line output already gives info about the mean and the SD of insert size while its running.

ADD REPLYlink
0
Entering edit mode

What command are you talking about? Tha samtools flagstat usually gives a basic statistics of the mapping like uniquely-paired, total reads mapped, singletons etc. I'm not sure which step you are mentioning. Can you explain me a bit more?

ADD REPLYlink
0
Entering edit mode

when you ran bwa it displayed the insert size info - but if you did not save that then that information is probably lost now

ADD REPLYlink
0
Entering edit mode
4.0 years ago
matted 7.0k
Boston, United States

The Picard function CollectInsertSizeMetrics will do that for you, including text output and making a histogram PDF.

ADD COMMENTlink
0
Entering edit mode

I can't remember why now, but I wasn't satisfied with the mean and SD that was given by picard tools. It resulted in huge reduction of the % of mapped reads. I wrote my own script to compute the inner distance mean and sd, seems to work fine.

ADD REPLYlink
0
Entering edit mode

I tried with the Picard function "CollectInsertMetrics"

my commands:

java -Xmx2g -jar ~/bin/picard-tools-1.70/CollectInsertSizeMetrics.jar H=GMMhist.txt DEVIATIONS=10 I=GMM.bam O=GMMinsertsize.txt R=ref.fasta VALIDATIONSTRINGENCY=LENIENT

The programme took the input as below:

[Mon Aug 20 10:00:00 BST 2012] net.sf.picard.analysis.CollectInsertSizeMetrics HISTOGRAM _FILE=GMM_ hist.pdf DEVIATIONS=10.0 INPUT=SOD _pal_ BWA _GMM.PE.sorted.bam.sorted_ cleaned _GMM.bam OUTPUT=GMM_ insert _size.txt REFERENCE_ SEQUENCE=Sodalis _ref.fasta VALIDATION_ STRINGENCY=LENIENT MINIMUM _PCT=0.05 METRIC_ ACCUMULATION _LEVEL=[ALL_ READS] ASSUME _SORTED=true STOP_ AFTER=0 VERBOSITY=INFO QUIET=false COMPRESSION _LEVEL=5 MAX_ RECORDS _IN_ RAM=500000 CREATE _INDEX=false CREATE_ MD5 _FILE=false [Mon Aug 20 10:00:00 BST 2012] Executing as acdguest2@euler01.black on Linux 2.6.38.4 amd64; Java HotSpot(TM) 64-Bit Server VM 1.6.0_ 22-b04; Picard version: 1.70(1215)

Error message:

WARNING 2012-08-20 09:43:49 SinglePassSamProgram File reports sort order 'queryname', assuming it's coordinate sorted anyway.

WARNING 2012-08-20 09:43:49 CollectInsertSizeMetrics All data categories were discarded because they contained < 0.05 of the total aligned paired data.

WARNING 2012-08-20 09:43:49 CollectInsertSizeMetrics Total mapped pairs in all categories: 0.0

[Mon Aug 20 09:43:49 BST 2012] net.sf.picard.analysis.CollectInsertSizeMetrics done. Elapsed time: 0.00 minutes. Runtime.totalMemory()=506396672

I didn't really understand: "All data categories were discarded because they contained < 0.05 of the total aligned paired data."...what does it mean? how can I solve the problem?

ADD REPLYlink
0
Entering edit mode

What output does samtools flagstat give for your bam file?

ADD REPLYlink
0
Entering edit mode

Here is my flagstat ouput

6699458 + 0 in total (QC-passed reads + QC-failed reads)

0 + 0 duplicates

5036221 + 0 mapped (75.17%:nan%)

6699458 + 0 paired in sequencing

3349729 + 0 read1

3349729 + 0 read2

4807142 + 0 properly paired (71.75%:nan%)

4942444 + 0 with itself and mate mapped

93777 + 0 singletons (1.40%:nan%)

22334 + 0 with mate mapped to a different chr

20960 + 0 with mate mapped to a different chr (mapQ>=5)

* My reference genome (1 chromosome, 4 plasmids and 1 phage)

ADD REPLYlink
0
Entering edit mode

Hm... so the error message is coming from the MINIMUM_PCT argument. The docs say: "When generating the histogram, discard any data categories (out of FR, TANDEM, RF) that have fewer than this percentage of overall reads. (Range: 0 to 1) Default value: 0.05." So it means Picard doesn't think many of your reads fit these categories (which they should, hopefully). What does Picard ValidateSamFile say about your bam? What did you use to make it? My only guess now is maybe the mate pair information isn't written in the bam in the way that Picard expects it.

ADD REPLYlink
0
Entering edit mode

The bam file I used initially was created by BWA and then cleaned up by GATK (realigned by " RealignerTargetCreator" and "IndelRealigner") and sorted by Samtools. Now I just checked whether there is any incompatibility between the bam file and the picard tools. So, I used the bam file created by samtools before cleaning up step with GATK. Now, I got some validation error below.

.................... ..................... I picked up a few lines of error meaage

Ignoring SAM validation error: ERROR: Record 5130866, Read name HWUSI-EAS1643R:29:FC:6:15:5339:15720, MAPQ should be 0 for unmapped read.

Ignoring SAM validation error: ERROR: Record 5130880, Read name HWUSI-EAS1643R:29:FC:6:25:18867:7265, MAPQ should be 0 for unmapped read.

Ignoring SAM validation error: ERROR: Record 5130881, Read name HWUSI-EAS1643R:29:FC:6:31:6733:1961, MAPQ should be 0 for unmapped read. Ignoring SAM validation error: ERROR: Record 5130916, Read name HWUSI-EAS1643R:29:FC:6:98:1162:2696, MAPQ should be 0 for unmapped read.

Ignoring SAM validation error: ERROR: Record 5130942, Read name HWUSI-EAS1643R:29:FC:6:48:19322:1813, MAPQ should be 0 for unmapped read.

................................................ ..........................................

ADD REPLYlink
0
Entering edit mode

Those errors are expected (see the FAQ here). My only remaining guess is that somehow the BAM sort order matters... but I don't see why that would be the case. In my scripts I only run CollectInsertSizeMetrics on coordinate-sorted BAMs, passing ASSUME_SORTED=true. Might be something to try. Sorry I can't be more helpful.

ADD REPLYlink
1
Entering edit mode

I also faced the same issue. Without the bam file sorted, it produces the error. After sorted, it works well.

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1