Question: Number of reads in different alignments
Entering edit mode

Dear all,

Just to confirm: it is normal to have a different number of reads when aligning the same reads against two different genomes?

I think the answer is YES because the aligner generates different hits (reads) according to the reference genome. Also, the number of reads of the starting fastq file is different from that of the SAM file, thus SAM does not report a number that is tethered to the starting file.

For instance, I have:

x=$(zcat <file>_1.fq.gz | wc -l)
echo $L
1 190 389 447 # same number on the paired file
samtools flagstat <align_first_reference>.sam |  grep 'total' | cut -d+ -f1
2 383 795 990
samtools flagstat <align_second_reference>.sam |  grep 'total' | cut -d+ -f1
2 381 177 452

Thank you.

Entering edit mode

You also need to account for secondary alignments that are bound to be different as well.

ADD REPLYlink 17 months ago
Entering edit mode

If they're different genomes, of course. The sequence won't be exactly the same so it's to be expected that different numbers will map.

ADD COMMENTlink 17 months ago jrj.healey 12k
Entering edit mode

Also, filled gaps in newer versions of the reference may cause reads to become multimappers and potentially trigger the aligner to flag them as unmapped, given they align to too many regions. HISAT2 has such an issue if I remember correctly, producing notably different alignments when using hg19 vs hg38. What kind of references are you using?

ADD REPLYlink 17 months ago

Login before adding your answer.

Powered by the version 1.8