Filtering reads on the basis of percentage of ambiguous(N's) characters
1
0
Entering edit mode
5.8 years ago
Varun Gupta ★ 1.3k

Hi,

I want to filter my fastq reads based on ambiguous characters, basically N's. So if I have a read sequence having N's greater than 5%, I want to discard that read, if lower than 5%, I want to keep it. It would be really helpful if someone know about any tool which already does that.

Thanks

fastq RNA-Seq filter • 2.3k views
ADD COMMENT
1
Entering edit mode
5.8 years ago

using paste + awk:

gunzip -c input.fq.gz |\
 paste - - - - |\
awk -F '\t' '{S=$2;L=1.0*length(S);gsub(/[^ATGCatgc]/,"",S);L2=length(S); if(L2/L > 0.05) print $0;}' |\
tr "\t" "\n"
ADD COMMENT

Login before adding your answer.

Traffic: 1957 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6