Peak Detection In Intergenic Regions In Rna-Seq Data
4
7
Entering edit mode
13.9 years ago
Michael 54k

I am currently working on RNA-seq data. One of the aims is to scan intergenic regions for transcription of e.g. small RNAs, missing transcripts. I have aligned to the reference genome and filtered the reads (bwa + blat) and can compute overlaps and coverage in R/Bioconductor and IRanges, that works. The problem is methodical: how to best detect a peak in the sequencing coverage in an intergenic region. My procedure now is a rather naive:

  • Find regions of coverage > arbitrary threshold (say 30), with width > min.width (e.g. 30 bases)
  • Remove all regions overlapping with genes/coding sequences(maybe + some 10 bp for promoters/UTR)

    I am doing this using views(), coverage(), and findOverlaps() from the IRanges package in R.

That way I can find something, but the point I do not like is the arbitrary cutoff, because of this I am also missing a lot of candidate peaks or if I set the threshold too low I will get lots of false positive. The data is not strand specific, and I have different samples but no biological replicates.

Do you have a better suggestion for selecting a possibly dynamic cutoff or for finding peaks in sequencing data, doesn't have to be in R, just a hint to an algortihm will do.

rna peak-calling bioconductor coverage • 9.8k views
ADD COMMENT
5
Entering edit mode
13.9 years ago

A good place to get started would be adapting some of the statistical frameworks using in ChIP-seq analysis. Bioconductor has some higher level tools which will help with the normalization and peak finding. The slides and code from the Seattle workshop late last year. UC Riverside's awesome documentation should be helpful. You'll want to avoid the +/- strand requirements since your underlying data is different, but the general windowing approaches sound similar to what you've attempted.

As far as setting cutoffs, there is always a trade off between false positives and negatives. One approach, given that you are tackling a custom problem, is to categorize a set of regions by hand as yes/no answers according to some by eye criteria. Assuming this test set is relatively representative, you can then use it to measure your sensitivity and specificity at different cutoffs. Biconductor's ROC package might be useful here.

The alternative is to look at the distribution of scores and select a cutoff based on what percentage of the regions you expect to capture based on your biological understanding.

ADD COMMENT
3
Entering edit mode
13.0 years ago

You might have a look at using Scripture.

ADD COMMENT
0
Entering edit mode

Thanks Sean, I will definitely look into this tool, too, though the 'documentation' is very rudimentary, but that might improve. After taking a quick glance, it seems to me that that Scripture contains nothing related my problem I couldn't do easier with R/Bioconductor, but maybe you can point this out?

ADD REPLY
2
Entering edit mode
7.2 years ago

If you are trying to find deferentially expressed regions that don't necessarily match annotated regions, I would recommend derfinder. It's pretty quick (at least compared to trying to perform a novel assembly with cufflinks), although I typically would at least start by focusing on annotation-based quantification strategies.

However, it won't probably won't work as well with smaller regions (but, unless you are specifically using a small RNA-Seq protocol, your library size is probably larger than 150 bp and isn't really suitable for quantifying smaller RNA sequences).

I also haven't tested derfinder without replicates, and I would generally recommend the use of biological replicates for better quality results.

ADD COMMENT
1
Entering edit mode
13.0 years ago
Alper Yilmaz ▴ 100

The question is already answered but still I wanted to offer an alternative for whomever reached this page for his/her needs.

The USeq software, which was designed for ChIP-Seq initially, can be used for determining regions that are enriched in RNA-Seq as well. This might eliminate the arbitrary cut-off problem. Then, you can easily find regions that are not intergenic by using bedtools.

I hope this helps.

ADD COMMENT
0
Entering edit mode

Will this software work for Illumina data?

ADD REPLY

Login before adding your answer.

Traffic: 2302 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6