Biostar Beta. Not for public use.
Question: Seqtk subseq: structure of file name.lst
0
Entering edit mode

Hi All,

If I'm not wrong with the command subseq of seqtk program I am able to fish out some reads with a specific sequence, is that right?

the cmd is:

seqtk subseq  in.fastq  list.txt  > out.fastq

In this case it is not clear the structure of the file list.txt, are the sequence I want to look for enough? I didn't find anything about this file in the documents of the seqtk.

Thank you

ADD COMMENTlink 22 months ago ste.lu • 40 • updated 22 months ago RamRS 21k
Entering edit mode
1

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time. code_formatting

ADD REPLYlink 22 months ago
RamRS
21k
Entering edit mode
0

Thanks Ram for reformatting the question!

ADD REPLYlink 22 months ago
ste.lu
• 40
Entering edit mode
1

one id per line (without @).

Example fastq:

@SOLEXA-1GA-2_0048_FC6242L:2:1:4066:1152#0/1
TGAATAGCTGGAGGAATGCAGACCTCTGNNNNNNNN
+
fff^ScffafcSdcfd^cYYcY]WeBBBBBBBBBBB
@SOLEXA-1GA-2_0048_FC6242L:2:1:4469:1153#0/1
AAACCAAAGTCCAGATAAAGGCACGCCCNNNNNNNN
+
ffaaffaaf]RZK__ceYe[acfafBBBBBBBBBBB
@SOLEXA-1GA-2_0048_FC6242L:2:1:4676:1152#0/1
TATCTGAGGCNAATCACACACTCTACTTNNNNNNNN
+
fggggfffafBcfffeedeegggcg`baBBBBBBBB
@SOLEXA-1GA-2_0048_FC6242L:2:1:5350:1149#0/1
GCTCGCAGTGNGCCGATCAGGGCGTAGNNNNNNNNN
+
^gacgdfddfBeece]Y\\^fc]fcBBBBBBBBBBB
@SOLEXA-1GA-2_0048_FC6242L:2:1:5581:1148#0/1
TCCTCCTCTGNAAAATGAGTCACTGCANNNNNNNNN
+
gggggfdfffBffffffdffgfgagabBBBBBBBBB

Example list:

SOLEXA-1GA-2_0048_FC6242L:2:1:4066:1152#0/1
SOLEXA-1GA-2_0048_FC6242L:2:1:4469:1153#0/1

Output:

$ seqtk subseq test.fastq list.txt 
@SOLEXA-1GA-2_0048_FC6242L:2:1:4066:1152#0/1
TGAATAGCTGGAGGAATGCAGACCTCTGNNNNNNNN
+
fff^ScffafcSdcfd^cYYcY]WeBBBBBBBBBBB
@SOLEXA-1GA-2_0048_FC6242L:2:1:4469:1153#0/1
AAACCAAAGTCCAGATAAAGGCACGCCCNNNNNNNN
+
ffaaffaaf]RZK__ceYe[acfafBBBBBBBBBBB
ADD REPLYlink 22 months ago
cpad0112
11k
Entering edit mode
0

Thanks for the answer!

Then this is not what I was looking for. Is there a way to look for a particular sequence into FASTQ files and fish out all the reads that have that sequence?

ADD REPLYlink 22 months ago
ste.lu
• 40
Entering edit mode
0

There are two ways to fish out reads of interest from a fastq: 1) by read ID 2) by Sequence/part of sequence

Command posted in OP pertains to fish out reads by name. If you want fish out sequence, you may want to furnish example input and expected output here.

ADD REPLYlink 22 months ago
cpad0112
11k
Entering edit mode
0

Thanks for the explanation cpad0112.

Let's say I have this fastq file:

@M01712:427:000000000-BK9JB:1:1102:16657:1483 1:N:0:0
GGATGTTGCTTTCTCCCGCTTCTTGCTGTTCATAGCTGTTTCCTTCTATCGCTTGTTCACCCGTCTTTACTCCCGTCACCTCACTCTCTCTTCTGCCTTCTTCTCCTTCCTAAAAAAAAACTTCTCTTTTCTTTTCTTTTTTTTTTTTTTTTTTCTTTTCTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
+
7+,C,CE,6EAFEE,;,+6@,,,9,,:,,:6,<,,,C,:CCC@C,,,,<9,,,,,,,6,6,:+88C,,,9<<4,+:,,4,:4,,,,5,:?,A,C,,,,:,?5<9-955,,,,,,,,+++++,,,,,,,,,,,,,3,37,,,+++****1****1,7,,,,2,72,,77,,5****/1**//*//*1/*1****/*//*/1/+303+0/2:*****12*/1/**1/1*//*1*2**)1)11)1///)/)))/1)./)/010/)//)0

@M01712:427:000000000-BK9JB:1:1102:17384:1519 1:N:0:0
GATTCCTAGGCTCGAGGTTTTTCCTTTCTTTTTCCTTTTTTCTTTATAGCCCCCTTCTTTCCTCCCTTCCCTTCCCTTTTTCCTTTTCCCTCTTCTTCTTTTTCCACTTTCTTTCTTTTCTTTTCTTTTTTCTTTTTTTTTCTCCTCTTCTTCTCTTTTTCTTTTCTTTTTTTTTTTCTCTTTTTTTTTTTTTTTCTCTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTT
+
+,,;CE9,,,66C++,+,,,,:6,9,,6<,CEE<C@,,,,86,,,,,,,9,6,6+:,:,,,69,:,,C,,4,9,,9,,:,,,,<,9,,9,B,,A,995@?9,,,-,,,,,,,,,,5,,,,,,,,,,,,85+3,,5,,++++,7,,3,,,,,,,,,7,,3,,3337,3,,33*15*11,7,2,,++**1**//1//+300020+2++3+/**:*/*/1:*/28***1/*2*//:*:3:)9)97)/7))1)171))1)/)9)1)2:/1

@M01712:427:000000000-BK9JB:1:1102:9753:1520 1:N:0:0
GAGTCCGAGTCTGCCTCGCTTTCATCTCTTTTTCCTGATCTCTTCAGATCACACTTCTTCTCTCCCTTCACCTCTCTATCTCTTCTTCCCTCTTCTTCTTTTTATCTCACCCCTTCTTTTTTTTTTTCTTTTCTTTTCTTCTTTCTCTTTTTTTTCTTTCTTTTTTTTTTCTCTTTTTTTTTTCTCTCTCCTTCTTCTTTTTTTTTCTCTTTTTTTTTCTTTTTTTTTTTTTTTTTTTTCTTTTTTTTTTTTTTTTTCTTTTCTTT

and I want to have in another file (i.e. subset.fq) only the reads that contain:

'GTTCATAGCTGTTTC'

To make a practical example: I have a Chip-Seq and I want to have only the reads with a specific motif in another file.

Thank you

ADD REPLYlink 22 months ago
ste.lu
• 40
Entering edit mode
0

I am not sure if seqtk can subset by string (pattern). Instead, try following. Download / install seqkit from here: https://github.com/shenwei356/seqkit#installation

Input (since fastq sequences provided above 2 full fastq records and half fastq record without quality information, I trimmed it to 2 fastq records):

$ cat test.fastq 
@M01712:427:000000000-BK9JB:1:1102:16657:1483 1:N:0:0
GGATGTTGCTTTCTCCCGCTTCTTGCTGTTCATAGCTGTTTCCTTCTATCGCTTGTTCACCCGTCTTTACTCCCGTCACCTCACTCTCTCTTCTGCCTTCTTCTCCTTCCTAAAAAAAAACTTCTCTTTTCTTTTCTTTTTTTTTTTTTTTTTTCTTTTCTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
+
7+,C,CE,6EAFEE,;,+6@,,,9,,:,,:6,<,,,C,:CCC@C,,,,<9,,,,,,,6,6,:+88C,,,9<<4,+:,,4,:4,,,,5,:?,A,C,,,,:,?5<9-955,,,,,,,,+++++,,,,,,,,,,,,,3,37,,,+++****1****1,7,,,,2,72,,77,,5****/1**//*//*1/*1****/*//*/1/+303+0/2:*****12*/1/**1/1*//*1*2**)1)11)1///)/)))/1)./)/010/)//)0

@M01712:427:000000000-BK9JB:1:1102:17384:1519 1:N:0:0
GATTCCTAGGCTCGAGGTTTTTCCTTTCTTTTTCCTTTTTTCTTTATAGCCCCCTTCTTTCCTCCCTTCCCTTCCCTTTTTCCTTTTCCCTCTTCTTCTTTTTCCACTTTCTTTCTTTTCTTTTCTTTTTTCTTTTTTTTTCTCCTCTTCTTCTCTTTTTCTTTTCTTTTTTTTTTTCTCTTTTTTTTTTTTTTTCTCTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTT
+
+,,;CE9,,,66C++,+,,,,:6,9,,6<,CEE<C@,,,,86,,,,,,,9,6,6+:,:,,,69,:,,C,,4,9,,9,,:,,,,<,9,,9,B,,A,995@?9,,,-,,,,,,,,,,5,,,,,,,,,,,,85+3,,5,,++++,7,,3,,,,,,,,,7,,3,,3337,3,,33*15*11,7,2,,++**1**//1//+300020+2++3+/**:*/*/1:*/28***1/*2*//:*:3:)9)97)/7))1)171))1)/)9)1)2:/1

output:

$ seqkit grep  -sdip  GTTCATAGCTGTTTC test.fastq 
@M01712:427:000000000-BK9JB:1:1102:16657:1483 1:N:0:0
GGATGTTGCTTTCTCCCGCTTCTTGCTGTTCATAGCTGTTTCCTTCTATCGCTTGTTCACCCGTCTTTACTCCCGTCACCTCACTCTCTCTTCTGCCTTCTTCTCCTTCCTAAAAAAAAACTTCTCTTTTCTTTTCTTTTTTTTTTTTTTTTTTCTTTTCTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
+
7+,C,CE,6EAFEE,;,+6@,,,9,,:,,:6,<,,,C,:CCC@C,,,,<9,,,,,,,6,6,:+88C,,,9<<4,+:,,4,:4,,,,5,:?,A,C,,,,:,?5<9-955,,,,,,,,+++++,,,,,,,,,,,,,3,37,,,+++****1****1,7,,,,2,72,,77,,5****/1**//*//*1/*1****/*//*/1/+303+0/2:*****12*/1/**1/1*//*1*2**)1)11)1///)/)))/1)./)/010/)//)0
ADD REPLYlink 22 months ago
cpad0112
11k
Entering edit mode
0

Thanks cpad, it's working!! Actually I want them in another file so I am using:

seqkit grep  -sdip  GTTCATAGCTGTTTC test.fastq > output.fastq

Do you agree?

Thanks a lot again!

ADD REPLYlink 22 months ago
ste.lu
• 40
Entering edit mode
0

correct. But consider following suggestions:

  1. If you want exact motif GTTCATAGCTGTTTC in capital/black letters, remove i (-sdp instead of -sdip ignoring degenerate bases and case specific). i means ignore case.
  2. You can use >. But there are easier ways to program. -o <output.fastq> will give you output.fastq. Better if you use -o out.fastq.gz, program will output gzipped fastq.
  3. If query motif doesn't contain degenerate bases, -sdip and -srip mean the same (as I understand). If your query contains degenerate bases, be aware of the differences between -r and -d options. Explained with example below.

In options -sdip (-d) , d means IUPAC degenerate mode and r means regex. -sdip match degenerate bases in query with those from target sequence. However this doesn't work in reverse.

For eg if query sequence has degenerate base "S", program will match with matching DNA bases "G" or "C" (as per IUPAC DNA code) in the target sequence. Motif GTTCATAGCTGTTTS matches both GTTCATAGCTGTTTG and GTTCATAGCTGTTTC in target. In your case, query sequence doesn't have any degenerate bases in the motif.

Following is an example: Motif is GTTTTCSCC and target is GTTTTCGCC.

$ seqkit grep -srp  GTTTTCSCC goodHg19.fastq

No results

$ seqkit grep -sdp  GTTTTCSCC goodHg19.fastq

result with same reference fastq because S in query matches with G or C:

@SOLEXA-1GA-2_0048_FC6242L:2:1:14720:1151#0/1
GTTTTCGCCANTTTTGCTCAGTACTTCTNNNNNNNN
+
ggcggdfdfdBeeec`ceeegcfgg^aBBBBBBBBB
ADD REPLYlink 22 months ago
cpad0112
11k

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0