Question

Different % alignment in Log Final & sum of ReadsPerGene

0

Entering edit mode

5.0 years ago

andrebolerbarros • 0

Hi everyone,

I'm using STAR to align my reads against Mus musculus genome. After aligned, I'm using the Log.final.out to assess the quality of the assessment (% uniquely mapped, etc.). Considering I want to use gene counts to import in DESeq2, I use the corresponding --quantMode GeneCounts option.

However, after removing the first 4 lines of each ReadsPerGene file and selecting the corresponding column for the strandness of the data, the total sum of counts does not correspond to the number of reads aligned in Log.Final.Out.

Could someone help me understand the discrepancy?

Thanks!

STAR • 1.1k views

ADD COMMENT • link updated 3.1 years ago by Ken • 0 • written 5.0 years ago by andrebolerbarros • 0

0

Entering edit mode

Please show us an example illustrating your point.

ADD REPLY • link 5.0 years ago by Ram 43k

score 1 · Answer 1 · 2019-04-29

1

Entering edit mode

5.0 years ago

swbarnes2 14k

I don't see a # of mapped reads in my Log.final.out. Or how many reads mapped to genes. It will tell you how many read mapped uniquely to the genome, but there is no guarantee that those reads can be unambiguously assigned to a gene.

ADD COMMENT • link 5.0 years ago by swbarnes2 14k

0

Entering edit mode

This looks like the explanation. I have seen this too and can provide an example. In my case it is single-ended (3'TagSeq) reads mapped with the following command:

STAR \
 --runThreadN 18 \
 --genomeDir /genomes/MicOch1.0/ \
 --outSAMtype BAM SortedByCoordinate \
 --quantMode TranscriptomeSAM GeneCounts \
 --outFileNamePrefix ../STAR/sample_name.} \
 --readFilesIn sample_name.filtered_reads.fastq

The Log.final.out file shows:

                 Number of input reads |    13520742
              Average input read length |   100
                            UNIQUE READS:
           Uniquely mapped reads number |   9255644
                Uniquely mapped reads % |   68.46%
                  Average mapped length |   91.78
               Number of splices: Total |   568878
    Number of splices: Annotated (sjdb) |   380791
               Number of splices: GT/AG |   486735
               Number of splices: GC/AG |   7516
               Number of splices: AT/AC |   512
       Number of splices: Non-canonical |   74115
              Mismatch rate per base, % |   1.47%
                 Deletion rate per base |   0.05%
                Deletion average length |   1.95
                Insertion rate per base |   0.04%
               Insertion average length |   1.71
                     MULTI-MAPPING READS:
Number of reads mapped to multiple loci |   1436994
     % of reads mapped to multiple loci |   10.63%
Number of reads mapped to too many loci |   127772
     % of reads mapped to too many loci |   0.95%
                          UNMAPPED READS:

While the ReadsPerGene.out.tab includes the following:

N_unmapped   2,828,104 
N_multimapping   1,436,994 
N_noFeature  6,254,708 
N_ambiguous  44,659

Summing this entire second column yields 13,520,742 reads, exactly what we expected. But after you remove the first four rows, the reads that are mapped to genes is only 2,956,277. The huge discrepancy in my case, I think, is due to the incomplete annotation of the 3'UTRs in this genome. I still haven't figured out how to deal with that problem, but I am confident that STAR is mapping the reads to the genome, it just can't figure out which gene it represents.

ADD REPLY • link 3.1 years ago by Ken • 0