Question

Microrna Pipeline

7

Entering edit mode

10.1 years ago

xiaojuhu13 ▴ 150

I have two miRNA-seq dataset, already removing adapters appeared in raw data. What I want to do is to detect the differential expression between two samples(no replicates, I know it is not a good experimental design). The pipeline: 1)remove contaminations like mRNA, tRNA, snoRNA and so on; 2) select reads length between 18-26; 3) mapping to mature miRNA downloaded from miRBasev20 using bwa; 4) detect the differential expression between two samples using DESeq. However, I cannot do the first step, is there any mRNA, tRNA, snoRNA database downloaded online and is this step operated in linux or windows? And is this pipeline reasonable or not, do you have any more suggestions?

mirna • 11k views

ADD COMMENT • link updated 6.9 years ago by k.kathirvel93 ▴ 300 • written 10.1 years ago by xiaojuhu13 ▴ 150

0

Entering edit mode

you can use: miRNAkey - A software pipeline for the analysis of microRNA Deep Sequencing data

ADD REPLY • link 10.1 years ago by Rm 8.3k

0

Entering edit mode

You can download a lot of annotations using the UCSC table browser.

ADD REPLY • link 10.1 years ago by David Langenberger 11k

score 9 · Answer 1 · 2014-03-05

9

Entering edit mode

10.1 years ago

David Langenberger 11k

What I recommend to do:

Clip adapter sequences
Map the reads against a complete reference genome and not only against miRBase microRNAs.
Quantify microRNA expression using e.g. bedtools intersect or featureCounts
Use DESeq to find differentially expressed microRNAs

I don't think you have to filter out reads of length between 18-26. The length fraction you get after clipping should be in that range. The microRNA-Seq protocol should result in this length anyway.

Also reads mapping to rRNAs, mRNAs, etc. should be no problem. They map to these regions... no problem for your analysis.

Important: Allow multiple hits of each read when mapping! there are multiple copies of some microRNAs in the genome and if you do not allow it, the results might be misleading, or wrong.

ADD COMMENT • link 10.1 years ago by David Langenberger 11k

0

Entering edit mode

Thanks for so detailed suggestion, and I'm sorry for reply later because of time lag. Take one sample example,

I don't think you have to filter out reads of length between 18-26 In my case, after removingadapter(TGGAATTCTCGGGTGCCAAGGAACTC), Do FastQC check, there is a peak between 32-33 : mainly compose of GTTTCCGTAGTGTAGTGGTTATCACGTTCGCCT; TGCCTATGCTGAAACCCAGAGGCTGTTTCTGAGC; GCATTGGTGGTTCAGTGGTAGAATTCTCGCCT. so should I just let it alone, and do the mapping step?

Use DESeq to find differentially expressed microRNAs this step, which .gtf file should I select for the reads count? This species total .gtf file including mRNA, tRNA, miRNA annotation, or just gtf file including miRNA only downloaded from miRBase?

ADD REPLY • link 10.1 years ago by xiaojuhu13 ▴ 150

0

Entering edit mode

First of all, you should find a peak somewhere around 24nt. I do not know the sequencing protocol, you use, but normally, when doing microRNA-Seq, molecules of length 18-30 (more or less) are filtered and sequenced. If you have a different protocol and a bigger length region (e.g. < 100) was sequenced, I would also filter for read of length between 18-30. Otherwise you might count microRNA precursors (~70nt).

You don't have to filter out reads overlapping with annotated mRNAs, or ncRNAs. They just map to their locus of origin and give you information. There are known microRNAs overlapping with snoRNAs or tRNAs. Read for example [1] or [2]. So, when deleting these reads, you will loose information. Also reads mapping to mRNAs... you might loose mirtrons (microRNAs in introns). I highly recommend to just map all reads. When filtering data, or mapping to just a subset (miRBase precursors), you will always loose information. Especially, if you want to predict new microRNAs. ;)

Your question about what .gtf file to use. What I would do:

Do some read counting fist. How many reads overlap with mRNAs? How many reads overlap with different kinds of ncRNAs (rRNAs, tRNAs, snoRNAs, etc.)? How many reads overlap with microRNAs? And so on.... That gives you a feeling for your data. Does it make sense, what you see?
If you want to find differentially expressed miRBase microRNAs, then you should count these reads only for DESeq. Take the miRBase file.
If you want to predict new microRNAs and do a differential expression analysis with these, then predict microRNAs and take this file for your DESeq analysis (or combine known with predicted).

ADD REPLY • link 10.1 years ago by David Langenberger 11k

0

Entering edit mode

you should find a peak somewhere around 24nt--- yes, there are two peaks, one is around 22bp, another is 32-33bp. Most 32-33bp peak (I listed), I blast against blastn, are mRNA.

In my case, for microRNA data. I have Hiseq2000, single-end 100bp sequencing.

Do some read counting fist.--I am a newbie for bioinformatic, I really want to do this step, but I do not know how to do it. I know how to collapse reads (cluster same reads and have their counts) using bwa_sam_converter.pl, but how to know reads are tRNA, snoRNA et. Using blast tools or some other softwares. I am sorry, can you list what do you do for this step?

I used mirdeep2 to get the known and novel microRNA, but the output result(.csv), there are only known miRNA counts, then how to combine known and novel microRNA together? which other softwares can do this step?

ADD REPLY • link 10.1 years ago by xiaojuhu13 ▴ 150

0

Entering edit mode

To count reads overlapping with mRNAs, try something like

bedtools intersect -abam sorted_bam_file.bam -b list_of_mRNA_exons.bed (get these from UCSC Table Browser) -u | wc -l

That returns all entries of your sorted bam file that overlap with exon of mRNAs. You need the -u option, because some entries may overlap with more then one exon and you don't want to count them twice. With wc -l you count the lines of the result from bedtools intersect.

The new bedtools version also supports BED12 format using the -split option, but I never tried it. If it works, you could also input a BED12 file of complete genes, instead of exons. But in your case, it doesn't matter, since you don't wanna quantify genes.

ADD REPLY • link 10.1 years ago by David Langenberger 11k

0

Entering edit mode

Thanks for your answers, I just found that in the latest version of bedtools. "-bed" option needs to be specified since the default output is binary format.

ADD REPLY • link 5.3 years ago by chengrui7 • 0

0

Entering edit mode

Hi I am just curious about filtering spike-in before maping to bowtie2. I also have miRNA fastqc data to analyze and want to know how to filter spike-ins. I created a separate *.fa file listing all the spike-ins used. How should I proceed with Bowtie2 ? e.g. spike-ins

UniSP100
TCCCAAATGTAGACAAAGCA
UniSP101
TGAAGCTGCCAGCATGATCTA
UniSP102
CAGCCAAGGATGACTTGCCGG

ADD REPLY • link 6.8 years ago by Björn ▴ 110

score 1 · Answer 2 · 2014-03-05

1

Entering edit mode

10.1 years ago

Charles Warden 8.2k

If you are aligning to a genomic reference, then I don't think you need to filter the results per se so much as you need to annotate them appropriately. In fact, a QC plot showing the distribution of reads by gene type is probably useful.

If your institution has a license, I think you can do all of these steps in Partek.

I haven't personally tried them out, but I think these are open-source alternatives:

A: microRNA reads alignment

ADD COMMENT • link 10.1 years ago by Charles Warden 8.2k

0

Entering edit mode

It also looks like eRNA is another (new) option (which is compared to miRDeep2 and miRspring in the paper): http://www.biomedcentral.com/1471-2164/15/176/abstract

ADD REPLY • link 10.1 years ago by Charles Warden 8.2k

score 0 · Answer 3 · 2014-03-05

0

Entering edit mode

10.1 years ago

Ashutosh Pandey 12k

You can use RFAM (ftp://ftp.sanger.ac.uk/pub/databases/Rfam/CURRENT) database that contains sequences for non-coding RNAs. It contains sequences for rRNA, snRNA, snoRNA, tRNA etc. It also has a few sequences for miRNA that you would like to remove before your analysis. The Rfam.fasta.gz file contains sequences for all the organisms, so you may have to grep sequences specific to your species.

ADD COMMENT • link 10.1 years ago by Ashutosh Pandey 12k

0

Entering edit mode

I use miRExpress (http://mirexpress.mbc.nctu.edu.tw/) or perform the alignment using SHRiMP2 "mirna" mode against the miRBase.

ADD REPLY • link 10.1 years ago by Ashutosh Pandey 12k

score 0 · Answer 4 · 2014-03-05

0

Entering edit mode

10.1 years ago

David Langenberger 11k

You can also try miRanalyzer. It maps the reads against Rfam and other ncRNA databases, quantifies known microRNAs and predict new microRNA candidates.

There is an online version: http://bioinfo5.ugr.es/miRanalyzer/miRanalyzer.php

and a stand-alone version: http://bioinfo5.ugr.es/miRanalyzer/standalone.html

ADD COMMENT • link 10.1 years ago by David Langenberger 11k

0

Entering edit mode

Sorry David in case I there is no miRBase microRNAs for Aspergillus fumigatus, can I use another gff3 contains cDNA, proreins and ncRNA or what should I do??

thank you

ADD REPLY • link 7.6 years ago by zizigolu ★ 4.3k

score 0 · Answer 5 · 2017-06-08

0

Entering edit mode

6.9 years ago

k.kathirvel93 ▴ 300

Hi everyOne, This conversation is more helpful for me, coz the same question what i have, is discussed here. I have an another doubt, can anyone help me? How can i exclude microRNA sequences which are below 18 and above 30 N. Is there any Tools ?

ADD COMMENT • link 6.9 years ago by k.kathirvel93 ▴ 300

0

Entering edit mode

Can you post this as another question? Also what you need is a length filter which has nothing to do with miRNA, and has already been covered by several posts on biostar.

ADD REPLY • link 6.9 years ago by Rohit ★ 1.5k