Seqtk subseq: structure of file name.lst
0
1
Entering edit mode
6.0 years ago
ste.lu ▴ 80

Hi All,

If I'm not wrong with the command subseq of seqtk program I am able to fish out some reads with a specific sequence, is that right?

the cmd is:

seqtk subseq  in.fastq  list.txt  > out.fastq

In this case it is not clear the structure of the file list.txt, are the sequence I want to look for enough? I didn't find anything about this file in the documents of the seqtk.

Thank you

sequence next-gen • 13k views
ADD COMMENT
2
Entering edit mode

one id per line (without @).

Example fastq:

@SOLEXA-1GA-2_0048_FC6242L:2:1:4066:1152#0/1
TGAATAGCTGGAGGAATGCAGACCTCTGNNNNNNNN
+
fff^ScffafcSdcfd^cYYcY]WeBBBBBBBBBBB
@SOLEXA-1GA-2_0048_FC6242L:2:1:4469:1153#0/1
AAACCAAAGTCCAGATAAAGGCACGCCCNNNNNNNN
+
ffaaffaaf]RZK__ceYe[acfafBBBBBBBBBBB
@SOLEXA-1GA-2_0048_FC6242L:2:1:4676:1152#0/1
TATCTGAGGCNAATCACACACTCTACTTNNNNNNNN
+
fggggfffafBcfffeedeegggcg`baBBBBBBBB
@SOLEXA-1GA-2_0048_FC6242L:2:1:5350:1149#0/1
GCTCGCAGTGNGCCGATCAGGGCGTAGNNNNNNNNN
+
^gacgdfddfBeece]Y\\^fc]fcBBBBBBBBBBB
@SOLEXA-1GA-2_0048_FC6242L:2:1:5581:1148#0/1
TCCTCCTCTGNAAAATGAGTCACTGCANNNNNNNNN
+
gggggfdfffBffffffdffgfgagabBBBBBBBBB

Example list:

SOLEXA-1GA-2_0048_FC6242L:2:1:4066:1152#0/1
SOLEXA-1GA-2_0048_FC6242L:2:1:4469:1153#0/1

Output:

$ seqtk subseq test.fastq list.txt 
@SOLEXA-1GA-2_0048_FC6242L:2:1:4066:1152#0/1
TGAATAGCTGGAGGAATGCAGACCTCTGNNNNNNNN
+
fff^ScffafcSdcfd^cYYcY]WeBBBBBBBBBBB
@SOLEXA-1GA-2_0048_FC6242L:2:1:4469:1153#0/1
AAACCAAAGTCCAGATAAAGGCACGCCCNNNNNNNN
+
ffaaffaaf]RZK__ceYe[acfafBBBBBBBBBBB
ADD REPLY
0
Entering edit mode

Thanks for the answer!

Then this is not what I was looking for. Is there a way to look for a particular sequence into FASTQ files and fish out all the reads that have that sequence?

ADD REPLY
0
Entering edit mode

There are two ways to fish out reads of interest from a fastq: 1) by read ID 2) by Sequence/part of sequence

Command posted in OP pertains to fish out reads by name. If you want fish out sequence, you may want to furnish example input and expected output here.

ADD REPLY
0
Entering edit mode

Thanks for the explanation cpad0112.

Let's say I have this fastq file:

@M01712:427:000000000-BK9JB:1:1102:16657:1483 1:N:0:0
GGATGTTGCTTTCTCCCGCTTCTTGCTGTTCATAGCTGTTTCCTTCTATCGCTTGTTCACCCGTCTTTACTCCCGTCACCTCACTCTCTCTTCTGCCTTCTTCTCCTTCCTAAAAAAAAACTTCTCTTTTCTTTTCTTTTTTTTTTTTTTTTTTCTTTTCTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
+
7+,C,CE,6EAFEE,;,+6@,,,9,,:,,:6,<,,,C,:CCC@C,,,,<9,,,,,,,6,6,:+88C,,,9<<4,+:,,4,:4,,,,5,:?,A,C,,,,:,?5<9-955,,,,,,,,+++++,,,,,,,,,,,,,3,37,,,+++****1****1,7,,,,2,72,,77,,5****/1**//*//*1/*1****/*//*/1/+303+0/2:*****12*/1/**1/1*//*1*2**)1)11)1///)/)))/1)./)/010/)//)0

@M01712:427:000000000-BK9JB:1:1102:17384:1519 1:N:0:0
GATTCCTAGGCTCGAGGTTTTTCCTTTCTTTTTCCTTTTTTCTTTATAGCCCCCTTCTTTCCTCCCTTCCCTTCCCTTTTTCCTTTTCCCTCTTCTTCTTTTTCCACTTTCTTTCTTTTCTTTTCTTTTTTCTTTTTTTTTCTCCTCTTCTTCTCTTTTTCTTTTCTTTTTTTTTTTCTCTTTTTTTTTTTTTTTCTCTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTT
+
+,,;CE9,,,66C++,+,,,,:6,9,,6<,CEE<C@,,,,86,,,,,,,9,6,6+:,:,,,69,:,,C,,4,9,,9,,:,,,,<,9,,9,B,,A,995@?9,,,-,,,,,,,,,,5,,,,,,,,,,,,85+3,,5,,++++,7,,3,,,,,,,,,7,,3,,3337,3,,33*15*11,7,2,,++**1**//1//+300020+2++3+/**:*/*/1:*/28***1/*2*//:*:3:)9)97)/7))1)171))1)/)9)1)2:/1

@M01712:427:000000000-BK9JB:1:1102:9753:1520 1:N:0:0
GAGTCCGAGTCTGCCTCGCTTTCATCTCTTTTTCCTGATCTCTTCAGATCACACTTCTTCTCTCCCTTCACCTCTCTATCTCTTCTTCCCTCTTCTTCTTTTTATCTCACCCCTTCTTTTTTTTTTTCTTTTCTTTTCTTCTTTCTCTTTTTTTTCTTTCTTTTTTTTTTCTCTTTTTTTTTTCTCTCTCCTTCTTCTTTTTTTTTCTCTTTTTTTTTCTTTTTTTTTTTTTTTTTTTTCTTTTTTTTTTTTTTTTTCTTTTCTTT

and I want to have in another file (i.e. subset.fq) only the reads that contain:

'GTTCATAGCTGTTTC'

To make a practical example: I have a Chip-Seq and I want to have only the reads with a specific motif in another file.

Thank you

ADD REPLY
0
Entering edit mode

I am not sure if seqtk can subset by string (pattern). Instead, try following. Download / install seqkit from here: https://github.com/shenwei356/seqkit#installation

Input (since fastq sequences provided above 2 full fastq records and half fastq record without quality information, I trimmed it to 2 fastq records):

$ cat test.fastq 
@M01712:427:000000000-BK9JB:1:1102:16657:1483 1:N:0:0
GGATGTTGCTTTCTCCCGCTTCTTGCTGTTCATAGCTGTTTCCTTCTATCGCTTGTTCACCCGTCTTTACTCCCGTCACCTCACTCTCTCTTCTGCCTTCTTCTCCTTCCTAAAAAAAAACTTCTCTTTTCTTTTCTTTTTTTTTTTTTTTTTTCTTTTCTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
+
7+,C,CE,6EAFEE,;,+6@,,,9,,:,,:6,<,,,C,:CCC@C,,,,<9,,,,,,,6,6,:+88C,,,9<<4,+:,,4,:4,,,,5,:?,A,C,,,,:,?5<9-955,,,,,,,,+++++,,,,,,,,,,,,,3,37,,,+++****1****1,7,,,,2,72,,77,,5****/1**//*//*1/*1****/*//*/1/+303+0/2:*****12*/1/**1/1*//*1*2**)1)11)1///)/)))/1)./)/010/)//)0

@M01712:427:000000000-BK9JB:1:1102:17384:1519 1:N:0:0
GATTCCTAGGCTCGAGGTTTTTCCTTTCTTTTTCCTTTTTTCTTTATAGCCCCCTTCTTTCCTCCCTTCCCTTCCCTTTTTCCTTTTCCCTCTTCTTCTTTTTCCACTTTCTTTCTTTTCTTTTCTTTTTTCTTTTTTTTTCTCCTCTTCTTCTCTTTTTCTTTTCTTTTTTTTTTTCTCTTTTTTTTTTTTTTTCTCTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTT
+
+,,;CE9,,,66C++,+,,,,:6,9,,6<,CEE<C@,,,,86,,,,,,,9,6,6+:,:,,,69,:,,C,,4,9,,9,,:,,,,<,9,,9,B,,A,995@?9,,,-,,,,,,,,,,5,,,,,,,,,,,,85+3,,5,,++++,7,,3,,,,,,,,,7,,3,,3337,3,,33*15*11,7,2,,++**1**//1//+300020+2++3+/**:*/*/1:*/28***1/*2*//:*:3:)9)97)/7))1)171))1)/)9)1)2:/1

output:

$ seqkit grep  -sdip  GTTCATAGCTGTTTC test.fastq 
@M01712:427:000000000-BK9JB:1:1102:16657:1483 1:N:0:0
GGATGTTGCTTTCTCCCGCTTCTTGCTGTTCATAGCTGTTTCCTTCTATCGCTTGTTCACCCGTCTTTACTCCCGTCACCTCACTCTCTCTTCTGCCTTCTTCTCCTTCCTAAAAAAAAACTTCTCTTTTCTTTTCTTTTTTTTTTTTTTTTTTCTTTTCTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
+
7+,C,CE,6EAFEE,;,+6@,,,9,,:,,:6,<,,,C,:CCC@C,,,,<9,,,,,,,6,6,:+88C,,,9<<4,+:,,4,:4,,,,5,:?,A,C,,,,:,?5<9-955,,,,,,,,+++++,,,,,,,,,,,,,3,37,,,+++****1****1,7,,,,2,72,,77,,5****/1**//*//*1/*1****/*//*/1/+303+0/2:*****12*/1/**1/1*//*1*2**)1)11)1///)/)))/1)./)/010/)//)0
ADD REPLY
0
Entering edit mode

Thanks cpad, it's working!! Actually I want them in another file so I am using:

seqkit grep  -sdip  GTTCATAGCTGTTTC test.fastq > output.fastq

Do you agree?

Thanks a lot again!

ADD REPLY
0
Entering edit mode

correct. But consider following suggestions:

  1. If you want exact motif GTTCATAGCTGTTTC in capital/black letters, remove i (-sdp instead of -sdip ignoring degenerate bases and case specific). i means ignore case.
  2. You can use >. But there are easier ways to program. -o <output.fastq> will give you output.fastq. Better if you use -o out.fastq.gz, program will output gzipped fastq.
  3. If query motif doesn't contain degenerate bases, -sdip and -srip mean the same (as I understand). If your query contains degenerate bases, be aware of the differences between -r and -d options. Explained with example below.

In options -sdip (-d) , d means IUPAC degenerate mode and r means regex. -sdip match degenerate bases in query with those from target sequence. However this doesn't work in reverse.

For eg if query sequence has degenerate base "S", program will match with matching DNA bases "G" or "C" (as per IUPAC DNA code) in the target sequence. Motif GTTCATAGCTGTTTS matches both GTTCATAGCTGTTTG and GTTCATAGCTGTTTC in target. In your case, query sequence doesn't have any degenerate bases in the motif.

Following is an example: Motif is GTTTTCSCC and target is GTTTTCGCC.

$ seqkit grep -srp  GTTTTCSCC goodHg19.fastq

No results

$ seqkit grep -sdp  GTTTTCSCC goodHg19.fastq

result with same reference fastq because S in query matches with G or C:

@SOLEXA-1GA-2_0048_FC6242L:2:1:14720:1151#0/1
GTTTTCGCCANTTTTGCTCAGTACTTCTNNNNNNNN
+
ggcggdfdfdBeeec`ceeegcfgg^aBBBBBBBBB
ADD REPLY
1
Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time. code_formatting

ADD REPLY
0
Entering edit mode

Thanks Ram for reformatting the question!

ADD REPLY

Login before adding your answer.

Traffic: 2522 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6