Question

Filtering reads on the basis of percentage of ambiguous(N's) characters

0

Entering edit mode

5.8 years ago

Varun Gupta ★ 1.3k

Hi,

I want to filter my fastq reads based on ambiguous characters, basically N's. So if I have a read sequence having N's greater than 5%, I want to discard that read, if lower than 5%, I want to keep it. It would be really helpful if someone know about any tool which already does that.

Thanks

fastq RNA-Seq filter • 2.3k views

ADD COMMENT • link updated 5.8 years ago by Pierre Lindenbaum 161k • written 5.8 years ago by Varun Gupta ★ 1.3k

score 1 · Answer 1 · 2018-06-21

1

Entering edit mode

5.8 years ago

Pierre Lindenbaum 161k

using paste + awk:

gunzip -c input.fq.gz |\
 paste - - - - |\
awk -F '\t' '{S=$2;L=1.0*length(S);gsub(/[^ATGCatgc]/,"",S);L2=length(S); if(L2/L > 0.05) print $0;}' |\
tr "\t" "\n"

ADD COMMENT • link 5.8 years ago by Pierre Lindenbaum 161k