Question

Read count from a batch of fastq.gz files

0

Entering edit mode

6.7 years ago

Mitra • 0

Hi, How can we do Read count from a batch of fastq.gz files.

Obviously if we unzip it then it is easy by grep.

But what is the best way to do without unzipping.

I tried this for a batch of files:

for i in `ls  RIF*.fastq.gz | sort`; do zcat $i | echo $((`wc -l`/4)) ; done

Gives me result:

But When I unzip those same files and run grep with the identifier like

for i in `ls  *.fastq | sort`; do grep "@M03691" $i |wc -l; done

I get true result as

Which is clearly a mismatch. Can anybody please suggest me what i am missing out in my first try. Thanks, Mitra

next-gen sequencing • 6.5k views

ADD COMMENT • link updated 6.7 years ago by Devon Ryan 104k • written 6.7 years ago by Mitra • 0

1

Entering edit mode

If you have zcat then you must have zgrep. Use zgrep -c "^@Sequencer_ID" file.fq.gz to get your counts.

ADD REPLY • link 6.7 years ago by GenoMax 142k

0

Entering edit mode

Great works thanks..don't know why zgrep didn't come to my mind.

But still I really wonder why wc -l`/4 didn't work. Definitely all these fastq files have 4 lines.

ADD REPLY • link 6.7 years ago by Mitra • 0

score 1 · Answer 1 · 2017-09-13

1

Entering edit mode

6.7 years ago

Devon Ryan 104k

Apparently not all of your read names start with M03691.

ADD COMMENT • link 6.7 years ago by Devon Ryan 104k

0

Entering edit mode

Which should not happen for a file for one sample (unless multiple sequencers are involved) :)

ADD REPLY • link 6.7 years ago by GenoMax 142k

0

Entering edit mode

So far my money is on that as being the most likely explanation.

ADD REPLY • link 6.7 years ago by Devon Ryan 104k

0

Entering edit mode

No ..thats not the fact ..and yes they all named as M03691 as much I can see.

ADD REPLY • link 6.7 years ago by Mitra • 0

0

Entering edit mode

Either they don't all start with that or you have cases where each entry isn't 4 lines. There are no other possibilities.

ADD REPLY • link 6.7 years ago by Devon Ryan 104k

0

Entering edit mode

If this dataset was trimmed on the sequencer then there may be some entries with no sequence. Just a guess.

ADD REPLY • link 6.7 years ago by GenoMax 142k

0

Entering edit mode

BTW, wc -l / 4 ends up truncating the division. It'd be interesting to not round and see if your files have line numbers that aren't multiples of 4.

ADD REPLY • link 6.7 years ago by Devon Ryan 104k

0

Entering edit mode

Also when I am checking with one file I get exact match..but somehow above command not working..

 grep "@M" RIFA3D26_S20_L001_R2_001.fastq |wc -l
145592
grep "@M0369" RIFA3D26_S20_L001_R2_001.fastq |wc -l
145592

This confirms that all the reads have same name.

Also with line count for single file matches.

cat RIFA3D26_S20_L001_R2_001.fastq |  echo $((`wc -l`/4))
145592

Just still leaves me in confusion why line count don't work within zip files.

ADD REPLY • link 6.7 years ago by Mitra • 0

0

Entering edit mode

Your question isn't why line counting doesn't work (it does), but rather why you get different numbers. That's a rather important distinction.

ADD REPLY • link 6.7 years ago by Devon Ryan 104k

0

Entering edit mode

Yes I described above two different ways of Read count from a batch of fastq.gz files ..out of which $((wc -l/4)) don't work (rather I must say not getting expected or correct result) when applied to zip files. So I assume this is something to do with the zip files ..and try to understand how the wc works in zip files. Thanks Mitra

ADD REPLY • link 6.7 years ago by Mitra • 0