Question

Long Read Length, yet STAR says many reads too short

2

Entering edit mode

7.1 years ago

dec986 ▴ 370

Hello,

I've aligned single-cell RNA seq to mm10 using STAR. I only get about 13% uniquely mapped reads, with 79% being too short.

I get the following output:

                             Started job on |   Mar 09 14:04:53
                         Started mapping on |   Mar 09 14:07:01
                                Finished on |   Mar 09 14:23:11
   Mapping speed, Million of reads per hour |   67.13

                      Number of input reads |   18088226
                  Average input read length |   47
                                UNIQUE READS:
               Uniquely mapped reads number |   2298713
                    Uniquely mapped reads % |   12.71%
                      Average mapped length |   44.12
                   Number of splices: Total |   54580
        Number of splices: Annotated (sjdb) |   0
                   Number of splices: GT/AG |   51443
                   Number of splices: GC/AG |   601
                   Number of splices: AT/AC |   27
           Number of splices: Non-canonical |   2509
                  Mismatch rate per base, % |   6.80%
                     Deletion rate per base |   0.02%
                    Deletion average length |   1.51
                    Insertion rate per base |   0.02%
                   Insertion average length |   1.40
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   1405637
         % of reads mapped to multiple loci |   7.77%
    Number of reads mapped to too many loci |   95119
         % of reads mapped to too many loci |   0.53%
                              UNMAPPED READS:
   % of reads unmapped: too many mismatches |   0.00%
             % of reads unmapped: too short |   78.96%
                 % of reads unmapped: other |   0.03%
                              CHIMERIC READS:
                   Number of chimeric reads |   0
                        % of chimeric reads |   0.00%

These two pieces of information appear to contradict each other:

1) 78% of reads are too short

2) Average input read length 47 nucleotides.

I looked at the fastq file and there aren't many short reads. I don't understand what went wrong.

What explains the poor alignment?

STAR RNA-Seq single-cell alignment • 15k views

ADD COMMENT • link updated 7.1 years ago by datascientist28 ▴ 560 • written 7.1 years ago by dec986 ▴ 370

0

Entering edit mode

Did you check with fastQC how your read length distribution is?

ADD REPLY • link 7.1 years ago by Benn 8.3k

0

Entering edit mode

![enter image description here][1]

Hi b.nota, [1]: https://postimg.org/image/5xdvullcz/ but this just looks like a zoom out, yes some reads are too short, but this only be a fraction of a percent.

ADD REPLY • link 7.1 years ago by dec986 ▴ 370

0

Entering edit mode

You can change the minimum read length manually, hopefully this helps for you. See previous post:

STAR Aligner minimum read-length

ADD REPLY • link 7.1 years ago by Benn 8.3k

score 8 · Answer 1 · 2017-03-21

I had a similar problem a couple months ago. It's poorly labeled, too short doesn't actually mean literally "too short".

"too short" means that the best alignments STAR found were too short to pass the filters.

This is controlled by --outFilterScoreMinOverLread --outFilterMatchNminOverLread which by default are set to 0.66. which means that ~2/3 of the total read length (sum of mates) should be mapped.

So what your output means is that 12% of the reads aligned unniquely, 7.7% aligned but multimapped and then 80% of your reads couldn't align with the above parameters. You can try to reduce these parameters to see how many more reads will be mapped. However, it looks like your data might just be contaminated with that alignment score :((

Here is a link to the convo I had with Alex, the developer of STAR. LINK