Question

How do you find the percentage of reads contaminated by rRNA?

0

Entering edit mode

8.7 years ago

espop23 ▴ 60

I have a file of the intersecting regions of TSS and rRNA. How do I find the percentage of reads contaminated by rRNA?

RNA R rRNA • 4.5k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.7 years ago by espop23 ▴ 60

Ram · Answer 1 · 2015-08-28

Hi,

You can:

Download the ncRNA fasta sequences (here) and the gene sets annotation file (GTF)
Extract rRNA sequences from the file
Then map your reads to find out the number of reads that mapped to these rRNA sequences.

In more detail:

The fasta file will look like:

head Homo_sapiens.GRCh38.ncrna.fa
>ENST00000629478 proj_ncrna:known chromosome:GRCh38:CHR_HG1832_PATCH:210374154:210374267:-1 gene:ENSG00000281499 gene_biotype:snRNA transcript_biotype:snRNA
ACACTGGTTTCTCTTCAGATCGAATAAATCTTTCGCCTTTTACTAAAGATTTCCGTGGAG
AGAAACAAATCAGTTATAAGCTAATTTTTTGTAAGCCTTGCCCTGGGGAGGCAG
>ENST00000516494 proj_ncrna:known chromosome:GRCh38:CHR_HG2128_PATCH:67546651:67546754:1 gene:ENSG00000252303 gene_biotype:snRNA transcript_biotype:snRNA
GTGCTCACTTTGGCAACATACATACTAAAATTGGACGGATACAGACATAAACATGGCCCC
TGCACAAGGATGACATGCAAATTCATGAAGCATTCCATATTTTT

And the GTF file:

head rRNA_Homo_sapiens.GRCh38.81.gtf 
1    ensembl    gene    9437669    9437778    .    -    .    gene_id "ENSG00000252956"; gene_version "1"; gene_name "RNA5SP40"; gene_source "ensembl"; gene_biotype "rRNA";
1    ensembl    gene    13623184    13623284    .    -    .    gene_id "ENSG00000222952"; gene_version "1"; gene_name "RNA5SP41"; gene_source "ensembl"; gene_biotype "rRNA";
1    ensembl    gene    34112949    34113063    .    +    .    gene_id "ENSG00000201148"; gene_version "1"; gene_name "RNA5SP42"; gene_source "ensembl"; gene_biotype "rRNA";
1    ensembl    gene    37264677    37264786    .    -    .    gene_id "ENSG00000252368"; gene_version "1"; gene_name "RNA5SP43"; gene_source "ensembl"; gene_biotype "rRNA";

You can parse these files and make a rRNA reference genome fasta which will look like:

>RNA5SP40|ENSG00000252956
GTCTATGGCCATTGCACCCTGAACGTGCCAGATCTTGTCTCATCTTGGAAGCTAAGCAGGGTTGGGCTTGGAGGGGAGGAGGGTGAACCTCAGTTCAGGTTACTTAGCCT
>RNA5SP41|ENSG00000222952
GCCTACGGCCATACCATTCTGGATGCGTCTCAGAAGCTAAGCAGGGTCAGACCTGGCTGGTACTTGGATGGGAGTATATCAGCCACTGGGTGCTGTGGTGC
>RNA5SP42|ENSG00000201148

Then map your reads to this genome and you should get it.

Another way is to map your reads to the genome and use the gtf file to annotate it and find the percentage of reads mapped to the rRNA. Both works fine with minute differences.

I hope this helps.