Question

[solved] Read a file once and get next lines that match a list of pattern with awk ?

0

Entering edit mode

7.2 years ago

Franck8413 ▴ 20

Hi everyone,

I have problems with using awk, I don't get what I'm looking for, so I request your help. I have a big file which contains more than a billion lines. This file come from a sequecing and look like this

@K00114:439:HF27YBBXX:2:1101:28209:1209 1:N:0:NGAGGCTG_NTGTAGAT
NGATGGAAGAGCCCAACAGTGAATAACATCAGTAGAGGAGGTCCTGTCT
+
#AAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJ
@K00114:439:HF27YBBXX:2:1101:28229:1209 1:N:0:NGACTCCT_NTGTAGAT
NAACAAATCAGTGTTCTGTTGTTTGTCAAAATTTTGAACAAGCCTTGCG
+
#AAAAJJJAFJ7FJJFFJJJFJJJJJAJJJJJJJJJJJFJAJJF<A7<F
....

So every four lines I have a new read. I would like to read this file once and test every four line if one of the barcode from a list match with the barcode in the line 1,5,9 ... My list of barcode is in a different file, which in this example can be NGAGGCTG, AAAACCCC, AAAATTTT etc ... If it match, I would like to save the read in a new file. Here, the expected output would be this, because NGAGGCTG is present in my list and in the line starting with the '@'.

@K00114:439:HF27YBBXX:2:1101:28209:1209 1:N:0:NGAGGCTG_NTGTAGAT
NGATGGAAGAGCCCAACAGTGAATAACATCAGTAGAGGAGGTCCTGTCT
+
#AAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJ

I have to specify that my reads file is zipped, so I start by using gunzip -c read_filename or zcat. Note also, that the '1:N:0:NGAGGCTG_NTGTAGAT' is in the column 2 ($2). I tried many things, but don't know how to read the file just once and print only lines that match with my list of "pattern" and ignore read that doesn't match.

I tried something like this :

gunzip -c FCHF27YBBXX_L2_CHKPEI00001135_1.fq.gz | head | grep -A3 NGACTCCT | sed -n '/NGACTCCT/ {N;N;N;p;}'

But I don't succeed to make a loop on the pattern to change NGACTCCT by all the barcode from my list, I tried also with the awk structure awk '$2 ~ /pattern/ {for(i=1; i<=4; i++) {getline; print}}' but I also failed.

Thanks for your help !

awk reads barcode sed grep • 2.2k views

ADD COMMENT • link 7.2 years ago by Franck8413 ▴ 20

score 3 · Accepted Answer · 2017-03-14

3

Entering edit mode

7.2 years ago

Pierre Lindenbaum 161k

gunzip -c FCHF27YBBXX_L2_CHKPEI00001135_1.fq.gz  | awk 'NR%4==1 { ok=index($0,"NGAGGCTG")!=0;} {if(ok) print;}'

ADD COMMENT • link 7.2 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Hi Pierre, thanks for the answer. In fact, I did it also with

gunzip -c FCHF27YBBXX_L2_CHKPEI00001135_1.fq.gz | head | grep -A3 NGACTCCT | sed -n '/NGACTCCT/ {N;N;N;p;}'

But do you know how can I make a loop on this command to change the pattern "NGACTCCT" by others pattern store in a file, cause I have more than 50 barcodes to test ?

ADD REPLY • link 7.2 years ago by Franck8413 ▴ 20

1

Entering edit mode

grep -A3 NGACTCCT

if your read SEQUENCE (not name) contains this DNA, you're going to mess your input.

cause I have more than 50 barcodes to test ?

use a loop like

for B in ATG GAT ATC
do
gunzip -c FCHF27YBBXX_L2_CHKPEI00001135_1.fq.gz  | awk -v B=$B 'NR%4==1 { ok=index($0,B)!=0;} {if(ok) print;}' | gzip > ${B}.fastq.gz
done

ADD REPLY • link 7.2 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks again, This is exactly what I try to obtain. Thanks a lot.

ADD REPLY • link 7.2 years ago by Franck8413 ▴ 20

score 2 · Accepted Answer · 2017-03-14

SeqKit works easily using seqkit grep command, searching sequences by pattern(s) of name or sequence motifs

Firstly, save barcodes in a plain text file (one seq per line), named `barcodes.txt'.

Then retrieve FASAQ records by sequence motif (barcode):

cat barcodes.txt | while read barcode <&0; do \
    seqkit grep -s -i -p $barcode read_1.fq.gz -o read_1.$barcode.fq.gz;

    # gzip or pigz would be faster
    # gzip -d -c read_1.fq.gz | seqkit grep -s -i -p $barcode  | gzip -c > read_1.$barcode.fq.gz;
done

You can also parallize this using GNU parallel or rush.

cat barcodes.txt | parallel 'seqkit grep -s -i -p {} read_1.fq.gz -o read_1.{}.fq.gz'

cat barcodes.txt | rush 'seqkit grep -s -i -p {} read_1.fq.gz -o read_1.{}.fq.gz'

BBmap can also do this, and it allows mismatch.

score 2 · Accepted Answer · 2017-03-14

Sounds to me like you are trying to demultiplex this data file.You can use demuxbyname.sh from BBMap suite to do that.

demuxbyname.sh in=r#.fq.gz out=out_%_#.fq.gz prefixmode=f names=GGACTCCT+GCGATCTA,TAAGGCGA+TCTACTCT,... outu=filename

"names=" can also be a text file with one barcode per line (in exactly the format found in the read header). You do have to include all of the expected barcodes, though.

In the output filename, the "%" symbol gets replaced by the barcode; in both the input and output names, the "#" symbol gets replaced by 1 or 2 for read 1 or read 2. It's optional, though; you can leave it out for interleaved input/output, or specify in1=, in2=, out1=, out2= if you want custom naming.