Getting sequences from fastq file using Grep command
3
0
Entering edit mode
10 months ago
mohsamir2016 ▴ 30

I have been trying to get a sequence (e.g. GCGAGCCCCACATCGCCCCCCCGATTGTAATAAATAA) from a fastq file (file.fastq) I have and output a fq file. I have tried the command:

grep -A 2 -B 1 'GCGAGCCCCACATCGCCCCCCCGATTGTAATAAATAA' file.fastq | sed '/--/d' > output.fq

I got an output as 6- lines of reads that are different, while my target sequence is basically more in number. A part from that they might be just not there, Are the - A and - B is getting only the upstream and trailing sequences surroundng my target or do they get these along with the target sequences. I red the -- help page, but could not understand it. Thanks for any help.

Linux • 1.8k views
ADD COMMENT
0
Entering edit mode

use --no-group-separator instead of sed

ADD REPLY
0
Entering edit mode

Are you confident this sequence will only appear once?

ADD REPLY
0
Entering edit mode

Actually it is much longer I only gives example...

ADD REPLY
0
Entering edit mode

let's try another way. Show us the output of:

cat file.fastq | paste - - - - | grep -F GCGAGCCCCACATCGCCCCCCCGATTGTAATAAATAA 
ADD REPLY
0
Entering edit mode

Could you let me know what paste is doing ?

Thanks

ADD REPLY
1
Entering edit mode

Converting to a single line

ADD REPLY
1
Entering edit mode
10 months ago
GenoMax 141k

Use bbduk.sh from BBMap suite in filter mode. grep has its place but a proper tool is foolproof.

bbduk.sh -Xmx2g in=your.fq outm=filtered.fq literal=GCGAGCCCCACATCGCCCCCCCGATTGTAATAAATAA

Add rcomp=f if you don't want to find reverse complemented sequence.

ADD COMMENT
1
Entering edit mode
10 months ago

Try seqkit grep, which searches on both strands, you might need to add --only-positive-strand.

seqkit grep -s -p GCGAGCCCCACATCGCCCCCCCGATTGTAATAAATAA file.fastq -o out.fq.gz
ADD COMMENT
0
Entering edit mode
10 months ago

If things aren't working the way you think they should, go simpler to troubleshoot. Do

grep 'GCGAGCCCCACATCGCCCCCCCGATTGTAATAAATAA' file.fastq  | wc -l

Do you get more than 6 lines?

(Is your fastq really unzipped?)

ADD COMMENT
0
Entering edit mode

It is actually unzipped and it gave me 6-lines when running your line

ADD REPLY
0
Entering edit mode

Then that exact sequence is only in there 6 times.

ADD REPLY

Login before adding your answer.

Traffic: 1855 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6