miRna-Seq, featureCounts problem
2
2
Entering edit mode
4.9 years ago
AngAdd ▴ 20

Hi everyone!

I'm new to Bioinformatics. I need to write a pipeline for miRNA-Seq. The situation is the following: I have three normal samples and three cancer samples. I downloaded the stem-loop and the mature human sequences from miRBase, then I performed the mapping using Bowtie. Now I need to obtain the read counts for each sample; I'm trying to use featureCounts with mirBase hsa.gff3 file and my bam file but featureCounts doesn't work! It returns this:

Assigned    0 

Unassigned_Unmapped 0

Unassigned_MappingQuality   0

Unassigned_Chimera  0

Unassigned_FragmentLength   0

Unassigned_Duplicate    0

Unassigned_MultiMapping 0

Unassigned_Secondary    0

Unassigned_Nonjunction  0

Unassigned_NoFeatures   1831739

Unassigned_Overlapping_Length   0

Unassigned_Ambiguity    0

This is my command:

> featureCounts -t miRNA -g 'Name' -a my_path/hsa.gff3 -o my_path/my_bam_file

I have also tried with -g ID but the result is the same.

I have done a lot of researches but I don't understand where is the problem.

Thank you.

miRNA featureCounts miRBase • 5.0k views
ADD COMMENT
2
Entering edit mode
4.9 years ago

featureCounts allows you to count the number of reads mapped to a whole genome sequence using a gff file that contains the co-ordinates of your features (in this case miRNAs).

However, you have not mapped to the whole genome, you have mapped only to a set of sequences derived from miRNAs themselves. As such your bam file will say things like "Read 1 is mapped to base 1 of miR-29a", where as featureCounts is looking for something like "Read1 is mapped to base 34,234,546 of chromosome 5, and miR-29a covers bases 34,234,546- 34,234,566".

In order to get the numbers of reads aligned to each miR after having mapped to a miR sequence collection, simply use samtools idxstats like so:

samtools idxstats my_path/hsa.gff3 -o mirna/cancer/featureCountsmy_path/my_bam_file
ADD COMMENT
0
Entering edit mode

Hi, thank you for your answer!

I tried to run your command line but I had this error message:

samtools idxstats: fail to load index for "my_path/hsa.gff3"

Then I tried to run this:

samtools index 4T_mapped.sorted.bam

samtools idxstats my_path/4T_mapped.sorted.bam

and I have obtained a result like this:

hsa-miR-576-3p 22 5 0

hsa-miR-140-5p 22 15 0

hsa-miR-522-5p 22 1 0

hsa-miR-1298-5p 22 0 0

hsa-miR-133a-3p 22 1 0

hsa-miR-4743-3p 21 0 0

hsa-miR-557 23 0 0

hsa-miR-548ao-3p 22 0 0

hsa-miR-5088-5p 24 1 0

hsa-miR-4649-5p 24 0 0

hsa-miR-665 20 0 0

.....

Is it what I'm searching for?

ADD REPLY
1
Entering edit mode

Yes. The first column is the length. The second column is the number of reads mapped to that miRNA, the thrid column is the number of unmapped reads on that miRNA (this is a pretty meaningless number, and will normally be 0).

ADD REPLY
0
Entering edit mode

Hi i.sudbery,

I trust you are well during these difficult times. I was wondering if you could advise on the idxstats tool you mentioned in a previous post. I aligned my small-RNA seq to miRBase mature miRNA assembly, and naturally I now want to do further analysis. This requires read counts, but as you pointed out previously, when not aligning to the genome conventional count tools that rely on annotations struggle a bit. My question, if I have made sure my aligned BAM file only contains the reads I want (in my case unique and 1 instance of each multi-mapper), can I simply sort the BAM and use idxstats tool and use the 3rd column as the read counts for each miRNA?

Thank you for you help. Kind regards

ADD REPLY
1
Entering edit mode

Yes. If the BAM file is correctly filtered, then you should just be able to use the 3rd column from the output of samtools idxstats mybam.bam

ADD REPLY
0
Entering edit mode

Brilliant. Thank you kindly!

ADD REPLY
0
Entering edit mode
4.9 years ago
Buffo ★ 2.4k

In addition to i.sudbery answer, I recommend you to map the reads against the genome, not against the mature/hairpin sequences. This second pipeline cause biased analysis since some of the actually known miRNAs are redundant in sequence (especially in mouse).

ADD COMMENT
0
Entering edit mode

The fact that the sequences are redundant is exactly why we should map to the hairpin sequences, rather than the genome. If miR-X is encoded in two places in the genome, then a read from miR-X will always be multi-mapped. Pipelines will then either ignore this read, or count it twice (once for miR-Xa and miR-Xb). If you map to a non-redundant set of mature sequences, then you will just get one, single mapping, hit against miR-X.

ADD REPLY
0
Entering edit mode
 If miR-X is encoded in two places in the genome, then a read from miR-X will always be multi-mapped.

That is exactly why you should use bowtie instead of bowtie2 and sequences of high quality.

Pipelines will then either ignore this read, or count it twice (once for miR-Xa and miR-Xb)

I prefer to ignore redundant sequences (if they are) in a genome than count them erroneously because they are in a set of "described" sequences.

If you map to a non-redundant set of mature sequences, then you will just get one, single mapping, hit against miR-X.

mirbase does not contain a set of non-redundant sequences, fasta files contains all hairpin and mature mirnas.

My point is, I consider correct to map the reads against the entire genome because smallRNAseq libraries are not specific to miRNAs, they sequence everything in your sample that is small in length. If you map your reads against to small set of sequences (such as mirbase files), yes, your mapping summary will be higher because you will count all redundant reads or even low-quality reads will increase the mapping. Even creating the nr-file you are deviating your analysis supposing that there are not new potential miRNAs.

ADD REPLY
0
Entering edit mode

So given a miRNA with two identical copies in the genome, you rather say the cell contains none of that miRNA, rather than say it contains some (and a no less accurate an estimate how much compare to a non-duplicated miRNA), but we have no way of knowing which copy it came from?

My point is, I consider correct to map the reads against the entire genome because smallRNAseq libraries are not specific to miRNAs, they sequence everything in your sample that is small in length.

Fair enough, we have been mapping to Rfam rather than mirbase. But even so, if you care about miRNAs, do you really care if there are some non-miRNA reads in your dataset that fail to map? Even if you map to the genome, you are unlikely to quantify those reads that don't map to annotated miRNAs unless you are specifically looking for them.

ADD REPLY
0
Entering edit mode
So given a miRNA with two identical copies in the genome

No sense! it does not exist. You don't understand what redundant sequence is.

but we have no way of knowing which copy it came from?

Yes, we can, mapping to the genome.

Fair enough, we have been mapping to Rfam rather than mirbase. But even so, if you care about miRNAs, do you really care if there are some non-miRNA reads in your dataset that fail to map?

Again, no sense!. After trimming smallRNAseq reads by quality we obtain more than 95% of mapped reads.

Even if you map to the genome, you are unlikely to quantify those reads that don't map to annotated miRNAs unless you are specifically looking for them.

That's my point, I do both things. Quantify known miRNAs and look for new ones, something like mirdeep2 proposes.

Anyways AngAdd, please tell us about your results of rt-qpcr validation after quantifying against mirbase files, these results will be very interesting.

ADD REPLY
1
Entering edit mode

No sense! it does not exist.

I'm afraid that's not true. For example, the mature sequence miR-29b-3p is expressed as identical sequences from two different locations, miR-29b-1 (chr7:130877459-130877539:(-)) and miR-29b-2 (chr1: 207802443-207802523:(-)). The mature product from these two hairpins is identical (UAGCACCAUUUGAAAUCAGUGUU) and thus indistinguishable from sequence. This is why the miBase fasta only contains one record for miR-29b-3p despite their being miR-29b-3p-1 and miR-29b-3p-2 in the genome.

That's my point, I do both things.

Good for you, but I'd propose that most people studying Humans or well annotated model organisms are unlikely to care about loci that are unannotated at this point, unless small RNA biology is their particular area of study.

ADD REPLY
0
Entering edit mode
The mature product from these two hairpins is identical (UAGCACCAUUUGAAAUCAGUGUU) and thus indistinguishable from sequence

That is not a new debate, it is an old question about isomirs, and again, if you map your reads against mirbase files you will count multi-hit reads as single. If you consider it correct ¯(°_o)/¯.

I'd propose that most people studying Humans or well annotated model organisms are unlikely to care about loci that are unannotated at this point, unless small RNA biology is their particular area of study.

Well annotated genomes do not mean they are completely annotated, and especially about new RNAs such as mirnas. Probably these unknown miRNAs are the answer to your biologic question, Don't you think?. But do not worry that is what researchers like to do, they will do for you.

Thanks for this nice discussion.

ADD REPLY
1
Entering edit mode

if you map your reads against mirbase files you will count multi-hit reads as single. If you consider it correct ¯(°_o)/¯.

Yes, because I care about how many molecules with the sequence UAGCACCAUUUGAAAUCAGUGUU are floating around in the cytoplasm. I couldn't care less if they came from a single genomic locus, 2 loci or 200 loci. A genome mapping pipeline that discards multi-mappers would say there was no miR-29b-3p in the cell, even if there was loads.

ADD REPLY

Login before adding your answer.

Traffic: 1942 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6