Question

Filter a large fasta or fastq by a query sequence, with parameterized fuzzy matching

0

Entering edit mode

5.5 years ago

ovon ▴ 20

I'm looking for a practical option for filtering a large fasta or fastq file. I'm dealing with a MiSeq read set where there was a problem with our Index reads, but we can separate the sequences by primer type. So far I have tried this approach:

use fuzzy matching to query the entire sequence set (agrep) allowing 2 mismatches. save matching sequences in a file.
then figure out which read names correspond to the matching sequences I identified in step 1 (grep -f)
then filtering the original fastq with that list of read names.

This works fine for subsets of the sequencing run, but if I want to do the entire run, it will take ages (many days, or perhaps weeks, depending on the size of the sequence set). This isn't practical.

I'm looking for an existing tool (or even a series of bash commands) that can take a query sequence (my primer) and filter the entire fasta or fastq based on a fuzzy match where I can set the number of allowed mismatches. It should be able to handle a full MiSeq run worth of reads (in my case, 5-10GB in fastq format). It doesn't need to work for fastq, since I can always filter my fastq files using the read names in a hypothetical fasta output.

It seems like something of this nature would exist already, but I'm having trouble finding anything that would work for the sizes of dataset I'm dealing with. This works on small sequence sets, but I downloaded it and modified the scripts and html files so it could handle my inputs, and it just crashes now: http://www.bioinformatics.org/sms2/fuzzy_search_dna.html I think the fact that it's linked with an HTML frontend is the issue. I don't have the expertise to modify the .js files beyond parameter modification. I am more familiar with awk, sed, and other bash commands, perl, and python.

Thanks in advance for any tips/answers!

fasta fastq filtering • 4.0k views

ADD COMMENT • link updated 5.5 years ago by finswimmer 16k • written 5.5 years ago by ovon ▴ 20

0

Entering edit mode

http://emboss.sourceforge.net/apps/cvs/emboss/apps/fuzznuc.html

ADD REPLY • link 5.5 years ago by cpad0112 21k

score 3 · Answer 1 · 2018-11-01

3

Entering edit mode

5.5 years ago

finswimmer 16k

Hello,

have a look at seqkit grep.

fin swimmer

ADD COMMENT • link 5.5 years ago by finswimmer 16k

0

Entering edit mode

Thank you very much, I tested this just now and it works very well. Much faster than my agrep-based method. A very useful tool to know about.

ADD REPLY • link 5.5 years ago by ovon ▴ 20