Biostar Beta. Not for public use.
Extracting specific sequences from FASTQ using Seqtk
0
Entering edit mode
21 months ago
omer.k • 20

I have a large fastq file containing NGS sequences. Using prior tools I've been able to narrow down the names of some 2,500 reads which are of interest to me, in OutputFile.txt. In this file there are names of the reads ordered in the following patter:

1123    084b5b69-e819-426f-9ff3-fea4891af330 runid=d20680f0495125cc465d6a96efb49b194cb0777b read=2290 ch=184 start_time=2018-03-27T11:36:09Z

1124    5c621711-9233-40a3-a544-419fe5135d83 runid=d20680f0495125cc465d6a96efb49b194cb0777b read=2382 ch=320 start_time=2018-03-27T11:35:19Z

etc...

I'd like to use this OutputFile.txt to extract only the names of the reads and the sequences from the original fastq file into a new filtered fastq file

I have Seqtk installed on windows, running via Command Line. I believe I could use Seqtk grep to extract only those 2,500 reads. I'm having trouble though with the proper command and the flags.

Could you please advise?

SeqKit • 864 views
ADD COMMENTlink
1
Entering edit mode

I think BBMap do this job, with filterbyname.sh script

filterbyname.sh in=reads.fq out=filtered.fq names=names.txt include=t

From this thread

Edit :

Also possible with seqtk subseq :

seqtk subseq  in.fastq  list.txt  > out.fastq

From here

ADD REPLYlink
0
Entering edit mode

Thanks @Bastien. But I have tried it before and it gives an error:

[ERRO] fastx: invalid FASTA/Q format

Any idea why this happens?

ADD REPLYlink
0
Entering edit mode

Maybe it is related to nanopore data, space in the headers... I never process these kind of data. If BBmap and seqtk cannot manage those data I cannot do much more sadly.

You have the best ones on the case, see below , I will not say better :)

ADD REPLYlink
0
Entering edit mode

There is always Biopython

ADD REPLYlink
0
Entering edit mode

Is this nanopore data? I wonder if the tools are having problems with the spaces in the names of fastq headers. You could certainly try filterbyname.sh or otherwise change the spaces to _ temporarily and then use the tools mentioned by Bastien.

ADD REPLYlink
0
Entering edit mode

This is indeed nanopore data. Could these spaces in the header of the nanopore data be the reason for the error I'm getting (see my reply to Bastien's suggestion)?

ADD REPLYlink
0
Entering edit mode

While I did not spend a lot of time it appears that neither filterbyname.sh or seqtk subseq work as intended with nanopore fastq data. I have asked @Brian about BBMap.

Also checking with @Wouter deCoster to see if his nanofilt tool can be easily modified to do this task.

ADD REPLYlink
0
Entering edit mode

The 084b5b69-e819-426f-9ff3-fea4891af330 portion should be a unique identifier. Therefore you don't need the other parts of your txt file/read names for filtering. I'd assume filterbyname.sh should work if you only use those unique parts.

ADD REPLYlink
0
Entering edit mode

Thanks Wouter, I understand your tip. However, what would be a fast way to filter the file which contains all the reads of my interest (that is, the file which I'd liketo guide the filtering of the FASTQ file), and keep just the identifiers, i.e. 084b5b69-e819-426f-9ff3-fea4891af330?

ADD REPLYlink
0
Entering edit mode
4 months ago
genomax 68k
United States

Simplest thing to do is put the patterns you need (from that cut command) in a file.

$ cat name.txt 
bba1f3f9-6b93-48b5-8c99-9c0d3eafa232
caa54c6e-bab0-4d7a-acd6-cad8011bdd10
125a259c-1937-47c1-9b17-5d88d833784f

then use grep to pull sequences out.

$ grep -f name.txt -A 3 your.fq > seq_you_need.fq
ADD COMMENTlink
0
Entering edit mode

Hi genomax, How would you extract though only the 084b5b69-e819-426f-9ff3-fea4891af330 from a line such as 1123 084b5b69-e819-426f-9ff3-fea4891af330 runid=d20680f0495125cc465d6a96efb49b194cb0777b read=2290 ch=184 start_time=2018-03-27T11:36:09Z?

Or even from 084b5b69-e819-426f-9ff3-fea4891af330 runid=d20680f0495125cc465d6a96efb49b194cb0777b read=2290 ch=184 start_time=2018-03-27T11:36:09Z (I can remove rather easily just the first string which are the prefix-numbers)?

ADD REPLYlink
0
Entering edit mode

If records are space separated then cut -d ' ' -f2 your_file > id_you_need should pull out that second column. If records are tab separate then just cut -f2 your_file > id_you_need should work.

ADD REPLYlink
0
Entering edit mode

Thanks, I've read online and used exactly this command (cut) to process the lines. Now I have a file with only the unique identifiers, which will guide the filtering of the FASTQ file. Will update.

ADD REPLYlink
0
Entering edit mode

Filtering by grep was successful. I used $ grep -f name.txt -A 3 your.fq > seq_you_need.fq as suggested by genomax. Thanks.

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3