Biostar Beta. Not for public use.
samtools - Finding Average Fragment size for pair?
0
Entering edit mode
2.9 years ago
oars • 150
@oars41179

I'm trying to find the average fragment size for reads in my .bam file and want to also exclude read pairs with a fragment size of greater than 1kb (Note: not read length, but fragment size for the pair.) I put together a command line script but the output is all wrong; this is what i tried:

enter code here
$samtools view -h SRR1611183.bam chr22 | awk 'length($10) > 1000 || $1 ~ /^@/' | samtools view -h - > bar2.bam  This basically gives me that same output as the idxstat command: samtools idxstats SRR1611183.bam chr22  These are good for finding the sequence length but I'm seeking the fragment size. Finally, I found a command line script that looked like a winner, but the output is all zeros? samtools view -h SRR1611183.bam chr22 | awk -F'\t' '{if((/^@/) || (length($10)>1000)){print 0}}' | samtools stats | grep '^SN' | cut -f 2-  And here is the output (something is off): raw total sequences: 0 filtered sequences: 0 sequences: 0 is sorted: 1 1st fragments: 0 last fragments: 0 reads mapped: 0 reads mapped and paired: 0 # paired-end technology bit set + both mates mapped reads unmapped: 0 reads properly paired: 0 # proper-pair bit set reads paired: 0 # paired-end technology bit set reads duplicated: 0 # PCR or optical duplicate bit set reads MQ0: 0 # mapped and MQ=0 reads QC failed: 0 non-primary alignments: 0 total length: 0 # ignores clipping bases mapped: 0 # ignores clipping bases mapped (cigar): 0 # more accurate bases trimmed: 0 bases duplicated: 0 mismatches: 0 # from NM fields error rate: 0.000000e+00 # mismatches / bases mapped (cigar) average length: 0 maximum length: 30 average quality: 0.0 insert size average: 0.0 insert size standard deviation: 0.0 inward oriented pairs: 0 outward oriented pairs: 0 pairs with other orientation: 0 pairs on different chromosomes: 0  Anyone see where I'm off? samtools • 1.1k views ADD COMMENTlink 3 Entering edit mode If you do a length on field 10, you're measuring the read length, which will never be longer than 1000 unless you're working with long read technologies and therefore will never show up in your output. If you want fragment size, you need to use field \$9 or calculate it yourself from the read position information, but that means you'd need to process the two lines of each pair together somehow (e.g. by sorting by read name and then comparing names when calculating the fragment size).

1
Entering edit mode
2.9 years ago
michael.ante ♦ 3.3k
@michael.ante14739

Do you have RNA-Seq or DNA-Seq data?

With DNA, you may just use bedtool's bamtobed with the bedpe option and compute the length.

For RNA-Seq you can have a look at RSEQC's inner_distance.py. Which you could adapt to your needs.

Cheers, Michael

0
Entering edit mode

whole-exome sequencing.

0
Entering edit mode

I'm not familiar with exome sequencing. Did you have splices in your aligned reads, or is the whole gene including introns covered?

With splicings have a look at RSeQC; without splicings have a look at bedtools

0
Entering edit mode

If I understand correctly, you target exon sequences, but might pull out a bit of attached intron sequence, since the whole DNA is sheared and only then captured. If I don't understand correctly, please ignore my rambling.