Question

Uniquely mapped in Human

0

Entering edit mode

6.6 years ago

Omics data mining ▴ 260

Hello everyone,

I am working on RNAseq data of human. I just want to know expected percentage of uniquely mapped reads for humans species by use of RNAseq approach. According to previous studies, In case of arabidopsis plant species, expected unique read count is approx 80% to 90%.

In my case, for some of the samples (rnaseq), uniquely mapped reads count is very less (20-40) as produced by STAR mapper while some of the samples comes with 70-80 uniquely mapped reads in bam file.

Thank you in advance

RNA-Seq • 3.6k views

ADD COMMENT • link updated 6.6 years ago by KVC_bioinfo ▴ 590 • written 6.6 years ago by Omics data mining ▴ 260

0

Entering edit mode

samtools can be used to collect uniquely mapped read. you just see samtools --help

ADD REPLY • link 6.6 years ago by qudrat ▴ 100

0

Entering edit mode

I have already used it. But I want to know expected percentage of uniquely mapped reads required for downstream analysis in case of human species. I think data with 20-30 uniquely mapped reads should not be used for downstream work. I think data with less uniquelly mapped reads should be excluded. Here is statistic of one of the samples mapped reads.

                     Number of input reads |    36520015
                  Average input read length |   202
                                UNIQUE READS:
               Uniquely mapped reads number |   2175352
                    Uniquely mapped reads % |   5.96%
                      Average mapped length |   197.64
                   Number of splices: Total |   294366
        Number of splices: Annotated (sjdb) |   0
                   Number of splices: GT/AG |   284698
                   Number of splices: GC/AG |   1873
                   Number of splices: AT/AC |   154
           Number of splices: Non-canonical |   7641
                  Mismatch rate per base, % |   0.48%
                     Deletion rate per base |   0.01%
                    Deletion average length |   1.36
                    Insertion rate per base |   0.01%
                   Insertion average length |   1.40
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   33005642
         % of reads mapped to multiple loci |   90.38%
    Number of reads mapped to too many loci |   457542
         % of reads mapped to too many loci |   1.25%
                              UNMAPPED READS:
   % of reads unmapped: too many mismatches |   0.00%
             % of reads unmapped: too short |   2.40%
                 % of reads unmapped: other |   0.02%
                              CHIMERIC READS:
                   Number of chimeric reads |   0
                        % of chimeric reads |   0.00%

There is 5 % uniquely mapped reads. Should I consider this type of data for next downstream analysis or discard it ?

ADD REPLY • link 6.6 years ago by Omics data mining ▴ 260

1

Entering edit mode

5% is VERY low for 2x100bp reads.

I would normally expect a minimum of 70%, and I guess we average above 90% with STAR.

If you look at your STAR output, you'll see that your problem is that 90% of reads map to multiple loci.

If this is standard RNA-seq, my first guess would be that you have failed to adequately deplete rRNA before library construction.

ADD REPLY • link 6.6 years ago by i.sudbery 19k

0

Entering edit mode

Dear sudbery ,

Thanks for analytical reply. I am agree with your points. There are few samples for which unique mapped reads is > 70 % while some samples unique mapped reads < 5 %. Is there any way to fix this issue or I should exclude the samples ??

ADD REPLY • link 6.6 years ago by Omics data mining ▴ 260

0

Entering edit mode

See the answer below. I guess those samples with <5% are almost certainly a dead loss, however you could look at the counts of reads that map to genes. I'd guess you'd want at least around 10M to use a sample.

ADD REPLY • link 6.6 years ago by i.sudbery 19k

0

Entering edit mode

Which software have you used? What type of downstream analysis you intend to do?

ADD REPLY • link 6.6 years ago by qudrat ▴ 100

0

Entering edit mode

I used STAR for mapping RNAseq data against human reference genome. Mapped reads will be used for the expression study.

ADD REPLY • link 6.6 years ago by Omics data mining ▴ 260

0

Entering edit mode

Do quality checking of your reads by using fastqc. If there is rRNA, throw them out before mapping. Is this data 101*2 pair end reads or 202 bp single end read.

ADD REPLY • link 6.6 years ago by Chirag Nepal ★ 2.4k

0

Entering edit mode

Few questions:

How did the QC data look like? Did you perform pre-processing?
Could you post the STAR command used?

ADD REPLY • link 6.6 years ago by KVC_bioinfo ▴ 590

score 0 · Answer 1 · 2017-09-28

Summarizing the discussion above: you are likely to have contamination with ribosomal RNA. You can quickly check it with BBDuk, see this older post.

I think you are in bad shape for differential expression analysis, for two reasons: 1) samples with low mapping rate will have insufficient coverage on features of interest (namely, the genes), resulting in low counts and low statistical power; and 2) in my experience, usually there are other (of unknown cause) batch effects associated with samples with high rRNA contamination.

You may try to proceed with the analysis with DESeq2 / edgeR and examine the PCA plot to check for rRNA batch effects, but there is no way around the low power arising from low counts.