[solved] Read a file once and get next lines that match a list of pattern with awk ?
3
0
Entering edit mode
7.2 years ago
Franck8413 ▴ 20

Hi everyone,

I have problems with using awk, I don't get what I'm looking for, so I request your help. I have a big file which contains more than a billion lines. This file come from a sequecing and look like this

@K00114:439:HF27YBBXX:2:1101:28209:1209 1:N:0:NGAGGCTG_NTGTAGAT
NGATGGAAGAGCCCAACAGTGAATAACATCAGTAGAGGAGGTCCTGTCT
+
#AAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJ
@K00114:439:HF27YBBXX:2:1101:28229:1209 1:N:0:NGACTCCT_NTGTAGAT
NAACAAATCAGTGTTCTGTTGTTTGTCAAAATTTTGAACAAGCCTTGCG
+
#AAAAJJJAFJ7FJJFFJJJFJJJJJAJJJJJJJJJJJFJAJJF<A7<F
....

So every four lines I have a new read. I would like to read this file once and test every four line if one of the barcode from a list match with the barcode in the line 1,5,9 ... My list of barcode is in a different file, which in this example can be NGAGGCTG, AAAACCCC, AAAATTTT etc ... If it match, I would like to save the read in a new file. Here, the expected output would be this, because NGAGGCTG is present in my list and in the line starting with the '@'.

@K00114:439:HF27YBBXX:2:1101:28209:1209 1:N:0:NGAGGCTG_NTGTAGAT
NGATGGAAGAGCCCAACAGTGAATAACATCAGTAGAGGAGGTCCTGTCT
+
#AAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJ

I have to specify that my reads file is zipped, so I start by using gunzip -c read_filename or zcat. Note also, that the '1:N:0:NGAGGCTG_NTGTAGAT' is in the column 2 ($2). I tried many things, but don't know how to read the file just once and print only lines that match with my list of "pattern" and ignore read that doesn't match.

I tried something like this :

gunzip -c FCHF27YBBXX_L2_CHKPEI00001135_1.fq.gz | head | grep -A3 NGACTCCT | sed -n '/NGACTCCT/ {N;N;N;p;}'

But I don't succeed to make a loop on the pattern to change NGACTCCT by all the barcode from my list, I tried also with the awk structure awk '$2 ~ /pattern/ {for(i=1; i<=4; i++) {getline; print}}' but I also failed.

Thanks for your help !

awk reads barcode sed grep • 2.2k views
ADD COMMENT
3
Entering edit mode
7.2 years ago
gunzip -c FCHF27YBBXX_L2_CHKPEI00001135_1.fq.gz  | awk 'NR%4==1 { ok=index($0,"NGAGGCTG")!=0;} {if(ok) print;}'
ADD COMMENT
0
Entering edit mode

Hi Pierre, thanks for the answer. In fact, I did it also with

gunzip -c FCHF27YBBXX_L2_CHKPEI00001135_1.fq.gz | head | grep -A3 NGACTCCT | sed -n '/NGACTCCT/ {N;N;N;p;}'

But do you know how can I make a loop on this command to change the pattern "NGACTCCT" by others pattern store in a file, cause I have more than 50 barcodes to test ?

ADD REPLY
1
Entering edit mode
grep -A3 NGACTCCT

if your read SEQUENCE (not name) contains this DNA, you're going to mess your input.

cause I have more than 50 barcodes to test ?

use a loop like

for B in ATG GAT ATC
do
gunzip -c FCHF27YBBXX_L2_CHKPEI00001135_1.fq.gz  | awk -v B=$B 'NR%4==1 { ok=index($0,B)!=0;} {if(ok) print;}' | gzip > ${B}.fastq.gz
done
ADD REPLY
0
Entering edit mode

Thanks again, This is exactly what I try to obtain. Thanks a lot.

ADD REPLY
2
Entering edit mode
7.2 years ago

SeqKit works easily using seqkit grep command, searching sequences by pattern(s) of name or sequence motifs

Firstly, save barcodes in a plain text file (one seq per line), named `barcodes.txt'.

Then retrieve FASAQ records by sequence motif (barcode):

cat barcodes.txt | while read barcode <&0; do \
    seqkit grep -s -i -p $barcode read_1.fq.gz -o read_1.$barcode.fq.gz;

    # gzip or pigz would be faster
    # gzip -d -c read_1.fq.gz | seqkit grep -s -i -p $barcode  | gzip -c > read_1.$barcode.fq.gz;
done

You can also parallize this using GNU parallel or rush.

cat barcodes.txt | parallel 'seqkit grep -s -i -p {} read_1.fq.gz -o read_1.{}.fq.gz'

cat barcodes.txt | rush 'seqkit grep -s -i -p {} read_1.fq.gz -o read_1.{}.fq.gz'

BBmap can also do this, and it allows mismatch.

ADD COMMENT
1
Entering edit mode

Thanks Shenwei356, I didn't know about seqkit grep and it also produces the expected result. Thanks you also for the parallization code, easy to use and usefull for my project !

ADD REPLY
2
Entering edit mode
7.2 years ago
GenoMax 142k

Sounds to me like you are trying to demultiplex this data file.You can use demuxbyname.sh from BBMap suite to do that.

demuxbyname.sh in=r#.fq.gz out=out_%_#.fq.gz prefixmode=f names=GGACTCCT+GCGATCTA,TAAGGCGA+TCTACTCT,... outu=filename

"names=" can also be a text file with one barcode per line (in exactly the format found in the read header). You do have to include all of the expected barcodes, though.

In the output filename, the "%" symbol gets replaced by the barcode; in both the input and output names, the "#" symbol gets replaced by 1 or 2 for read 1 or read 2. It's optional, though; you can leave it out for interleaved input/output, or specify in1=, in2=, out1=, out2= if you want custom naming.

ADD COMMENT

Login before adding your answer.

Traffic: 2138 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6