Read count from a batch of fastq.gz files
1
0
Entering edit mode
6.6 years ago
Mitra • 0

Hi, How can we do Read count from a batch of fastq.gz files.

Obviously if we unzip it then it is easy by grep.

But what is the best way to do without unzipping.

I tried this for a batch of files:

for i in `ls  RIF*.fastq.gz | sort`; do zcat $i | echo $((`wc -l`/4)) ; done

Gives me result:

133408
133408
127328
127328
124862
124862
156815
156815
146333
146333
123459
123459
159771
159771
126039
126039
167112
167112
97320
97320

But When I unzip those same files and run grep with the identifier like

for i in `ls  *.fastq | sort`; do grep "@M03691" $i |wc -l; done

I get true result as

132637
132637
126435
126435
123616
123616
156347
156347
145592
145592
123059
123059
158654
158654
125626
125626
166377
166377
97159
97159

Which is clearly a mismatch. Can anybody please suggest me what i am missing out in my first try. Thanks, Mitra

next-gen sequencing • 6.4k views
ADD COMMENT
1
Entering edit mode

If you have zcat then you must have zgrep. Use zgrep -c "^@Sequencer_ID" file.fq.gz to get your counts.

ADD REPLY
0
Entering edit mode

Great works thanks..don't know why zgrep didn't come to my mind.

But still I really wonder why wc -l`/4 didn't work. Definitely all these fastq files have 4 lines.

ADD REPLY
1
Entering edit mode
6.6 years ago

Apparently not all of your read names start with M03691.

ADD COMMENT
0
Entering edit mode

Which should not happen for a file for one sample (unless multiple sequencers are involved) :)

ADD REPLY
0
Entering edit mode

So far my money is on that as being the most likely explanation.

ADD REPLY
0
Entering edit mode

No ..thats not the fact ..and yes they all named as M03691 as much I can see.

ADD REPLY
0
Entering edit mode

Either they don't all start with that or you have cases where each entry isn't 4 lines. There are no other possibilities.

ADD REPLY
0
Entering edit mode

If this dataset was trimmed on the sequencer then there may be some entries with no sequence. Just a guess.

ADD REPLY
0
Entering edit mode

BTW, wc -l / 4 ends up truncating the division. It'd be interesting to not round and see if your files have line numbers that aren't multiples of 4.

ADD REPLY
0
Entering edit mode

Also when I am checking with one file I get exact match..but somehow above command not working..

 grep "@M" RIFA3D26_S20_L001_R2_001.fastq |wc -l
145592
grep "@M0369" RIFA3D26_S20_L001_R2_001.fastq |wc -l
145592

This confirms that all the reads have same name.

Also with line count for single file matches.

cat RIFA3D26_S20_L001_R2_001.fastq |  echo $((`wc -l`/4))
145592

Just still leaves me in confusion why line count don't work within zip files.

ADD REPLY
0
Entering edit mode

Your question isn't why line counting doesn't work (it does), but rather why you get different numbers. That's a rather important distinction.

ADD REPLY
0
Entering edit mode

Yes I described above two different ways of Read count from a batch of fastq.gz files ..out of which $((wc -l/4)) don't work (rather I must say not getting expected or correct result) when applied to zip files. So I assume this is something to do with the zip files ..and try to understand how the wc works in zip files. Thanks Mitra

ADD REPLY

Login before adding your answer.

Traffic: 1907 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6