How to extract sequences from multiple fastq files based on a certain sequence ?
3
0
Entering edit mode
10 months ago

Dear all,

I would like to know if it is possible to extract/retain sequences from multiple fastq files based on a certan input sequences, and get new fastq files containing only sequences sharing the input sequence I provided.

I have found this command line that does the job, but it does that for a single fastq file, while I would like to have a command line that does such function for all the fastq file I have in the folder.

zcat Root-R01_AGA_L001_R2_001.fastq.gz \
| paste - - - - \
| awk -v FS="\t" -v OFS="\n" '$2 ~ "TGGACTAC" {print $1, $2, $3, $4}' \
| gzip > Root-R01_AGA_L001_R2_001.filtered.fq.gz

Can someone help me in this?

Many thanks in advance

sequences fastq • 822 views
ADD COMMENT
3
Entering edit mode
10 months ago

use for loop or parallel:

$ find ./ -name "*.fq.gz"  \
    | parallel --dry-run 'seqkit grep -P -s -p AGAACATCTCTCGGAGCCACAGG {} -o {.}.filtered.gz'
seqkit grep -P -s -p AGAACATCTCTCGGAGCCACAGG ./G1_1.fq.gz -o ./G1_1.fq.filtered.gz
seqkit grep -P -s -p AGAACATCTCTCGGAGCCACAGG ./G1_2.fq.gz -o ./G1_2.fq.filtered.gz
ADD COMMENT
2
Entering edit mode
10 months ago
iraun 6.2k

You need a loop that goes through all the files in your folder. There are many ways to do it. One option is to go to the folder where you have your files, and run the following:

for f in *.fastq.gz; do
    name=${f%%.*} #Extract the name of the file without the extension
    zcat $f \
    | paste - - - - \
    | awk -v FS="\t" -v OFS="\n" '$2 ~ "TGGACTAC" {print $1, $2, $3, $4}' \
    | gzip > $name.filtered.fq.gz
done
ADD COMMENT
1
Entering edit mode
10 months ago
GenoMax 141k

To complete the set of tools I will mention using bbduk.sh in filter mode. It is part of BBTools package. I will also leave the loop aspect out since it is covered by others.

bbduk.sh -Xmx2g in=file.fq.gz outm=matched.fq.gz literal=sequence_you_want

You can turn off reverse complement lookup if you don't want that by using rcomp=f option.

ADD COMMENT

Login before adding your answer.

Traffic: 1474 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6