bam file filter by TLEN and output as fastq
3
1
Entering edit mode
9.6 years ago

I have a bam file that I want to filter by the TLEN (9th) column, then output the reads that pass filter in fastq format. I have tried combinations of awk and samtools bam2fq without much luck.

Anybody?

bam fastq samtools • 6.4k views
ADD COMMENT
0
Entering edit mode

What cut-off do you want to set for TLEN?

ADD REPLY
4
Entering edit mode
9.6 years ago

You could combine the suggestions so far into a script like so (the BAM file should be name sorted):

samtools view -h name_sorted.bam  | awk -f filter.awk | samtools view -Sb - | bedtools bamtofastq -i - -fq read1.fq
samtools view -h name_sorted.bam  | awk -f filter.awk | samtools view -Sb - | bedtools bamtofastq -i - -fq2 read2.fq

The filter.awk file is

# SIZE is the minimum template lenght
BEGIN { FS="\t"; SIZE=100; S2=SIZE*SIZE }

# Write the headers 
/^@/ { print $0; next }

# Print only long entries
{ if ($9*$9 > S2) print $0}
ADD COMMENT
0
Entering edit mode

This is pretty neat. The approach I was using would have involved multiple steps. it was something like first separate _1 and _2 reads

samtools view input.bam | awk '{if(and($2,0x40)){print > "file_1.sam"} else {print > "file_2.sam"};}'

then separating forward and reverse strand reads, then reverse complementing them, then filtering reads based on TLEN and merging everything back into one.

ADD REPLY
1
Entering edit mode
9.6 years ago
komal.rathi ★ 4.1k

Filtering on Template length for paired-end reads (Your bam file should be name sorted)

samtools view sample.bam | \
  awk '
    BEGIN {FS='\t'}
    { if ($9 > 100 || $9 < 500) print $0}
  ' | \
  grep -v ^@ | \
  awk 'NR%2==1 {print "@"$1"\n"$10"\n+\n"$11}' > ./sample_1.fastq

samtools view sample.bam | \
  awk '
    BEGIN {FS='\t'}
    { if ($9 > 100 || $9 < 500) print $0}
  ' | \
  grep -v ^@ | \
  awk 'NR%2==0 {print "@"$1"\n"$10"\n+\n"$11}' > ./sample_2.fastq

Filtering on Template length for single-end reads

samtools view sample.bam | \
  awk '
    BEGIN {FS='\t'}
    { if ($9 > 100 || $9 < 500) print $0}
  ' | \
  grep -v ^@ | \
  awk '{print "@"$1"\n"$10"\n+\n"$11}' > ./sample.fastq

Reference here

ADD COMMENT
0
Entering edit mode

Can the TLEN_CUTOFF be a range? For example, between 100-500?

ADD REPLY
0
Entering edit mode

I don't think you can do that. I'll check if there is a way around this.

ADD REPLY
0
Entering edit mode

14134125465346445 I have updated my answer to accommodate a range-based cutoff on read length.

ADD REPLY
0
Entering edit mode

I would like to filter by TLEN, which if my understanding is correct, is the Template length, not actually the read length, is that right?

ADD REPLY
0
Entering edit mode

14134125465346445 Yeah you are right. I got confused between read length & template length. Maybe you should try out Ashutosh Pandey's answer.

ADD REPLY
0
Entering edit mode

14134125465346445 I have updated my answer in order to address the correct problem.

ADD REPLY
0
Entering edit mode
9.6 years ago

You can try something like this. In order to print both reads for a read-pair you will have to use 'OR' operator as shown. I have used a threshold of 100 (highlighted in line 2) for TLEN.

samtools view input.bam | awk 'BEGIN {FS="\t"} {if ($9 > 100 || $9 < -100) print "@" $1 "\n" $10 "\n+\n" $11}' > input.fq
                                                   ^_____________________^
ADD COMMENT
0
Entering edit mode

this won't be identical to the original fastq file though since the SEQ field in a SAM file is displayed reverse complemented when the read aligns on the reverse strand (everything is on the forward strand).

ADD REPLY
0
Entering edit mode

Yup I totally forgot about that. I will have to either use FLAG info or polarity of TLEN (if concordant reads to be considered only) to check for the reads aligned on the reverse strand and print reverse complement seq for them. I will update my answer later.

ADD REPLY

Login before adding your answer.

Traffic: 1514 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6