I am currently using STAR to map several Hi-SEQ mRNA runs.
I'm having trouble getting a decent amount of reads to map, but I don't really understand why. I'm hoping you can shed some light :)
In the final log, only about 50% (or less) of the reads map to the reference. I'm using a GTF in addition to the genome.
The unmapped bin that most of the reads fall into is "too short" . What parameter I might be mis-specifying? These are PE 101 Illumina reads, the quality is pretty good up until the ~85th base, and we have around 200M reads per sample.
What may be causing the unmapped reads: too short set of reads to be so large, I'd really appreciate it.
My command like is like so:
# $1 = READ1 fq file
# $2 = READ2 fq file
# $3 = PREFIX for Output Files [*.BAM]
/path/to/STAR --genomeDir /path/to/Fly/ --readFilesCommand 'zcat -fc' --readFilesIn $1 $2 --runThreadN 32 --genomeLoad LoadAndRemove --outFilterMultimapNmax 100 --outFilterMultimapScoreRange 2 --outSAMstrandField None --outSAMmode Full --outSAMattributes Standard --outSAMunmapped None --outFilterType BySJout --outStd SAM | samtools view -b -o $3_STAR.bam -S -
And all Final Logs look something like this:
Started job on | May 20 22:49:23
Started mapping on | May 20 22:52:08
Finished on | May 21 05:18:10
Mapping speed, Million of reads per hour | 32.74
Number of input reads | 210640950
Average input read length | 202
Uniquely mapped reads number | 29841188
Uniquely mapped reads % | 14.17%
Average mapped length | 190.07
Number of splices: Total | 4405621
Number of splices: Annotated (sjdb) | 4123661
Number of splices: GT/AG | 4348429
Number of splices: GC/AG | 25290
Number of splices: AT/AC | 664
Number of splices: Non-canonical | 31238
Mismatch rate per base, % | 1.23%
Deletion rate per base | 0.03%
Deletion average length | 1.92
Insertion rate per base | 0.02%
Insertion average length | 2.41
Number of reads mapped to multiple loci | 36646260
% of reads mapped to multiple loci | 17.40%
Number of reads mapped to too many loci | 229494
% of reads mapped to too many loci | 0.11%
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 58.84%
% of reads unmapped: other | 9.49%
This is the command line I used to build the STAR index.
/path/to/STAR_2.3.0e/STAR --runMode genomeGenerate --genomeDir /path/to/Genomes/Fly/ --genomeFastaFiles /path/to/genomes/fly/dm3_genome.fa --runThreadN 16 --sjdbGTFfile /path/to/dm3_refGene_2011_02_15.gtf --sjdbGTFtagExonParentTranscript transcript_id --sjdbOverhang 100