Hi all,
I am newbie working on RNA-Seq analysis. I have samples processed using Illumina RNA Exome and Illumina TrueSEq librray protocols Paired end. I initially dif the qulaity control and I dont see any adpater contamination. I trimmed first 15 random reads and performed alignment step using STAR aligner. The samples are human so I aligned against hg38 reference genome. But the alignment is low. I have got around 64% uniquely mapped reads. I am not sure if this is rRNA contamination considering that the two library protocols lack the rRNA depletion step. I was trying to output the unmapped reads using OutReadsUnmapped Fastx. The output I get when I use the above option is unmapped.out.mate1 and unmapped.out.mate2. I am not sure if these are sam files or bam files or fastq files. From the manual I see that you get either fasta or fastq files. But I just see unmapped.out.mate1 and unamapped.out.mate2. I am trying to run blast on these unmapped reads to see if anything matches. Could somebody help me with converting the unmapped.out.mate1 file to fastq file?
Thanks in advance.
Best, Prat
Please don't do that. You are throwing away good data.
Does the genome used to make STAR index have rDNA repeat in it?
Have you done a
head -8 unmapped.out.mate1
? That file may already be fastq reads.Hi Devon, Here is my fastqc report. The input read length is 2*150bp length. I am unable to paste the fasqc image here. The mean quality value starts from 30 and goes to 40 and it is a straight line from there without any platos. I see a plato from 30 to 40 fro 10 bp. So, I have headcropped the 10 reads.
Don't head crop, leave it as is.
prathyushareddy87 : See How to add images to a Biostars post
Look at the first few lines of the unmapped files with
head
. Then you'll know if they're fastq or fasta or something else.What was the overall alignment rate? Usually it's something like 98%, so there's no point in blasting the small percent of junk that didn't align.
I am pasting my STAR output here.
I was wondering if number of reads matched to multiple loci and % of reads unmapped too short is high.
The multimapping rate is quite normal. The too-short rate seems high, though I expect that those are junk reads. We've had a few machines start spitting out low complexity sequence that will sort of align if you soft-clip it enough (but it's junk, so it's best not to). Have a look at a few reads and see what they look like. If you blast them and they turn out to be random sequencer junk then don't worry about it.
Thank you so much for your feedback and comments. I would try to align my fastq files without headcropping to the reference genome if that could improve mapping and also not loosing good data. I have tried head command on out.mate1 files and those are fastq files. I will run blast on these fastq files and see if those are any random sequencer junk files.
Not sure if you are going to be able to get a big improvement.
Reads that are not mapping are too short.
I am not sure how to deal with % of reads unmapped: too short | 16.49%. I am wondering if there is any parameter in the STAR aligner that I could use to improve the % of unmapped reads.
There is such an option, but you should see if it's worth while to map those first.
I am just wondering if i be liberal with the STAR parameters would I be able to improve my mapping? But again I understand that if the quantity used or quality of RNA used is bad that might be causing this. I have samples processed using different library protocol kits (Illumina RNA Exome, Lexogen QunatSeq 3 prime sequencing). The STAR alignment from Lexogen QuantSeq was very low just 45% with % of reads unmapped : other | 27%. The input read length for QuantSeq is 75bp and for Illumina is 150bp. My fastqc results look good. But the alignment is very low.
Ah Lexogen, that explains things. You're not going to get better then, those libraries produce a fair amount of junk sequence.