Kallisto read count processed does not tally
2
0
Entering edit mode
6.2 years ago
solyris83 ▴ 20

Hi,

I am running Kallisto quant with the following command

kallisto quant -i gencodeV27 -o sample1 --bias --rf-stranded --genomebam --gtf gencode.v27lift37.annotation.gtf --chromosomes chromSize.txt sample1_1.fq sample1_2.fq

And one particular summary line below caught my attention which says (correct me if I am wrong) that there is a total of 71,414,042 paired reads in my fastq files

[quant] processed 71,414,042 reads, 65,645,163 reads pseudoaligned

The processed reads number does not tally against the reads count I get from trim_galore. I am using trimmed fastq files for this testing and the summary reported number from trim_galore after processing is found below, which means that I have only 65,853,559 (= 66089021 - 235462) reads in my input fastq file for kallisto.

RUN STATISTICS FOR INPUT FILE: sample1.fastq.gz
=============================================
66089021 sequences processed in total

Total number of sequences analysed for the sequence pair length validation: 66089021

Number of sequence pairs removed because at least one read was shorter than the length cutoff (20 bp): 235462 (0.36%)

Any comment on my interpretation above would be much appreciated. Or TLDR, why is Kallisto processing more reads than what is in the input fastq?

Kallisto trim_galore • 3.6k views
ADD COMMENT
0
Entering edit mode

The nomenclature is often imprecise.

One of your tools seems to report pairs, the other reads. A pair contains two reads.

ADD REPLY
0
Entering edit mode

Try checking the read counts in your file(s) more directly.

grep -e "^@" sample1_1.fq | wc -l (prone to error)

 awk '{s++}END{print s/4}' sample1_1.fq

or for the zipped:

zcat sample1.fastq.gz | awk '{s++}END{print s/4}'

In all my runs, kallisto gives the read count in single-end mode and the pair count in paired-end mode.

ADD REPLY
1
Entering edit mode

Isn't it dangerous to check the read counts in FASTQ with "^@", given than quality lines may also start with "@" ?

ADD REPLY
0
Entering edit mode

Noted. I fixed it to recommend a less error-prone way. Main point is, though, that the questioner needs to verify the files more directly, because I can't replicate his discrepancy.

ADD REPLY
1
Entering edit mode
6.2 years ago
solyris83 ▴ 20

Hi, as recommended by mmfansler, I did a check on the raw fastq files with the shell statement, and as erwan rightly pointed out a lot more "reads" are counted which are basically sequence quality with first base being @.

The read counting from awk '{s++}END{print s/4}' sample1_1.fq tallies to the result from kallisto now.

This is quite surprising as the sequencing service provider actually provided a reads number generated for each of my sample and this read count tallies exactly with the trim galore number before trimming. And this reported number (~60 million reads pair) from service provider and trim galore is actually smaller than the counted number of lines from the fastq file (~70million reads for both fastq in the pair).

Maybe I need to dig a bit deeper into what filter has been applied before the number is reported from the service provider and maybe trim galore.

Thanks a lot for the help!

ADD COMMENT
0
Entering edit mode

I'm wondering if maybe some of your reads have line breaks in them. Technically, the FASTQ spec includes this possibility (like FASTA), and this could cause simple line counting to also be an inaccurate measure. However, it's conventional (and recommended) to have no line breaks in the seqs and quals in FASTQ.

How long are your reads? Have you also tried counting lines in the original file, pre-trimming?

I'm also curious to know how far off the "^@" approach was. Was it counting even higher than the 71M?

ADD REPLY
0
Entering edit mode
6.2 years ago
solyris83 ▴ 20

Hi mmfansler,

Reads are 100bp, I tried counting on the original reads below

Read 1 match on "^@" : 75,681,793 Read 1 count lines divide by 4 : 71,552,293

Read 2 match on "^@" : 77,581,579 Read 2 count lines divide by 4 : 71,552,293

ADD COMMENT

Login before adding your answer.

Traffic: 1928 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6