How to plot density between 2 bed files?
2
2
Entering edit mode
7.1 years ago
shahryary ▴ 50

I have 2 bed files (Stemloops chrY.bed and heart_fetal chrY.bed ) just simply in 3 column (chrname, start,end), I want to plot density in R to compare them. Also I want to set windows size like per 1mb, 10mb, ...

I started with "GenomeGraphs" from bioconductor tools but couldn't get any beautiful graph from there!

In other words I'm planning to plot a graph like picture above:

ploting

Also, I read some materials here that I included in list above, so I tried to follow some instruction but couldn't plot image which I supposed to have it.

Plotting Read Densities And Gene Tracks In R

Drawing Chromosome Ideograms With Data

So, How I can plot density in R?

R SNP • 5.8k views
ADD COMMENT
1
Entering edit mode

Where is your signal column?

ADD REPLY
0
Entering edit mode

In other words, if you want to plot signal, you need a signal to plot.

ADD REPLY
1
Entering edit mode

I don't have signal column, basically as I explained I want to plot density! I have some results like this: http://prntscr.com/ehfj2z but my question was how I can compare density of two files and create plot with more details.

ADD REPLY
5
Entering edit mode
7.1 years ago

Perhaps you want to count the number of elements over windows. This would be a different measurement from the plot you include in your question, which measures DNaseI accessibility, for which you will need a source of DNaseI signal (like reads from a DNaseI-seq assay, etc.).

Nonetheless, if you just want to count generic elements over genomic intervals, here is a way to do this efficiently.

Measuring signal

I will assume you have UCSC Kent Tools and BEDOPS toolkits installed, which you can get via Homebrew (brew install kent-tools and brew install bedops) or similar means, depending on your platform.

  1. Get the chromosome bounds for your reference genome of interest and parse them into BED. For example, here is how to get a BED file of chromosome extents for hg38:

    $ fetchChromSizes hg38 | awk 'BEGIN{OFS="\t"}{print $1,"0",$2}' | sort-bed - > hg38.bed
    

    Replace hg38 with hg19 or your genome build of choice, if desired.

  2. Use bedops to chop these bounds into, say, 100k-sized windows, sliding across the genome every 10k:

    $ bedops --chop 100000 --stagger 10000 hg38.bed > hg38.chop100k.stagger10k.bed
    

    You would adjust window and sliding parameters, depending on how fine or coarse you want to measure counts (signal) and how much you want them smoothed. Smaller window and sliding values will give you finer and smoother signal, while larger values will give coarser and rougher signal.

    Note: Since you are only interested in chrY (given the description of your inputs) you can focus bedops on chrY directly to save a considerable amount of time, by adding --chrom chrY:

    $ bedops --chrom chrY --chop 100000 --stagger 10000 hg38.bed > hg38.chrY.chop100k.stagger10k.bed
    
  3. Sort your BED files, if needed, e.g.:

    $ sort-bed stemloops.unsorted.chrY.bed > stemloops.chrY.bed
    $ sort-bed heart_fetal.unsorted.chrY.bed > heart_fetal.chrY.bed
    

    Make sure to sort your inputs, if you do not know their sort order.

  4. Map your files of interest to this windowed chromosome, using bedmap. For example, use the --count operator to count the number of elements from stemloops.chrY.bed that fall over the sliding windows. As with bedops, to save time, you can add --chrom chrY to bedmap to focus work on chrY:

    $ bedmap --chrom chrY --echo --count --delim '\t' hg38.chop100k.stagger10k.bed stemloops.chrY.bed > hg38.chop100k.stagger10k.stemloopsCount.bed
    

    Now you have a four-column BED file, where you have intervals of sliding windows and the number of elements from the mapped file that fall over those windows. This is also called a BedGraph file format.

    You would repeat this for your fetal heart input:

    $ bedmap --chrom chrY --echo --count --delim '\t' hg38.chop100k.stagger10k.bed heart_fetal.chrY.bed > hg38.chop100k.stagger10k.fetalHeartCount.bed
    

Visualizing signal

You can import this and other similarly-scored files into UCSC Genome Graphs to plot their signals over chrY (or other chromosomes). You can layer multiple signals and see the chromosome at a glance, and then zoom in a region of interest within the UCSC Genome Browser. You can export a PDF snapshot from both the Genome Graphs and Genome Browser tools.

If you want to use R, you could use read.table() to bring in the data and ggplot2 to plot that data, e.g.:

> stemloops <- read.table("hg38.chop100k.stagger10k.stemloopsCount.bed", header=F, sep="\t", col.names=c("chr", "start", "stop", "count"))
> fetal_heart <- read.table("hg38.chop100k.stagger10k.fetalHeartCount.bed", header=F, sep="\t", col.names=c("chr", "start", "stop", "count"))
> df <- data.frame(position=stemloops$start, stemloops=stemloops$count, fetal_heart=fetal_heart$count)
> library(ggplot2)
> ggplot(df, aes(x=position, y=signal, color=variable)) + geom_line(aes(y=stemloops, col="stemloops")) + geom_line(aes(y=fetal_heart, col="fetal_heart"))

Or you could use any number of other visualization libraries in R, like gviz.

ADD COMMENT
0
Entering edit mode

Do you mind adding output of ggplot?

ADD REPLY
1
Entering edit mode

Can you post your files somewhere? I could generate some random data but it won't look as nice (or relevant/representative).

ADD REPLY
1
Entering edit mode
ADD REPLY
1
Entering edit mode

I had to fix Stem_chrY.bed before mapping:

$ awk '{if ($2==$3) $3++; print $0}' Stem_chrY.bed | sort-bed - > Stem_chrY.sorted.bed

Otherwise, I seem to get a similar result:

plot

ADD REPLY
0
Entering edit mode

Here's an example of graphing bedmap output on UCSC Genome Graphs, using the hg38 reference genome:

enter image description here

GSM signal is in blue, Stem signal is in red.

(Are your datasets from hg38?)

ADD REPLY
0
Entering edit mode

No, I'm using hg19. I changed all your variables from hg38 to hg19 and got that results.

ADD REPLY
0
Entering edit mode

Oops, from the filenames I assumed you had the actual data, just wanted to see the ggplot output, how close is to the OP's expected output.

ADD REPLY
5
Entering edit mode
7.1 years ago
bernatgel ★ 3.4k

If you prefer a pure R solution, you can user karyoploteR. The package is still in Bioconductor devel, so you will need R-devel to use it until mid April.

You can compute the density and create the plot in a few lines:

library(karyoploteR)

window.size <- 1e5

#read the data with regioneR's toGRanges
gsm <- toGRanges("./GSM_chrY.bed")
stem <- toGRanges("./Stem_chrY.bed")

#create the plot
kp <- plotKaryotype(chromosomes="chrY")

#create the bins and compute the density
windows <- unlist(tileGenome(setNames(kp$chromosome.lengths, "chrY"), tilewidth = window.size))
stem.dens <- sapply(windows, function(w) {return(numOverlaps(w,stem))})
gsm.dens <- sapply(windows, function(w) {return(numOverlaps(w,gsm))})
max.dens <- max(max(stem.dens), max(gsm.dens))

#And plot it
kpLines(kp, data=windows, y=stem.dens, col="blue", ymax = max.dens)
kpLines(kp, data=windows, y=gsm.dens, col="red", ymax = max.dens)
kpAxis(kp, ymin=0, ymax=max.dens)

enter image description here

It can also create an image with the full ideogram (remove the "chromosomes" parameter from the plotKaryotype call), if you want, and you could customize the plot in many ways (including adding a second axis or plotting below the ideogram as in your example).

enter image description here

ADD COMMENT
0
Entering edit mode

Is there a reason for the spike in signal at the tail end of the chromosome?

ADD REPLY
0
Entering edit mode

As far as I can see in GSM there are 4117 regions between positions 58819693 and 59363443 and in STEM there are 6958 regions between 58819362 and 59363565, so these peaks are real.

I think the difference with your plots is due to the differences between GRCh37 and GRCh38. ChrY is shorter in GRCh38 (finishes at 57227415 instead of 59,373,566) and so these regions are out of your plot.

ADD REPLY
0
Entering edit mode

Agreed, I wasn't sure of the target genome build and just guessed (wrongly, it seems). Interesting tool. Are there options to pick a different genome?

ADD REPLY
0
Entering edit mode

Sure, you can specify a different genome when creating the karyoplot with plotKaryotype(genome="hg38") or plotKaryotype(genome="mm10"), for example. It has a small cache with a few frequently used organisms ans it the cache misses, it will use the available BSGenomes installed and download the cytoband data from UCSC. And if you need an organism that's not on UCSC or Bioconductor, you can manually specify a custom genome. It's quite flexible in this regard.

ADD REPLY
0
Entering edit mode

Why the number of overlaps is not same in two methods?

I can't install R-devel in server, so I decided calculate overlaps then pass the values to "karyoploteR" for plotting.

I calculated number of overlaps with Alex method (100k for chop & stagger) and then with "karyoploteR" library.

The number of elements are same - 594 elements.

but in some values they don't have same values!

for example: Number of overlaps for Alex method (first 23 record):
[1] 166 697 1330 755 618 588 440 571 652 1694 50 380 651 455 1701 1516 876 503 506 642 522 1399 852

form "karyoploteR" method (gsm.dens):

[1] 166 697 1318 766 614 591 438 572 651 1696 50 378 652 451 1701 1521 873 506 505 638 526 1397 849

Note: I used just "GSM_chrY.bed" to compare results.

Why the results is not same?

ADD REPLY
2
Entering edit mode

I don't know how karyoploteR works, but BEDOPS bedmap counts any GSM (or Stem, etc.) elements that overlap the windows if there is at least one base of overlap.

This may be too relaxed. It is possible that an element could overlap two or more windows if it is large enough and placed in the "right" spot — you could get double counts. Here is a cartoon to illustrate the double-count problem:

enter image description here

You can modify the bedmap statement to do counting only when an GSM (or Stem, etc.) element falls 51% or more over a reference window element:

$ bedmap --chrom chrY --fraction-map 0.51 --echo --count --delim '\t' windows.bed GSM.bed > counts.GSM.bed

If the elements being mapped are guaranteed to be more than twice the size of the overlap between windows (here, >20kb -- twice the 10kb step from one window to the next), using a --fraction-map value of 51% will force a mapped element to overlap only one or the other window, and only get counted once.

I'm not sure how other libraries deal with this. Keep in mind that having overlapping windows is intended to smooth signal as a result of these multi-count situations, so if you're trying to avoid double or triple etc. counts and get an exact count, you may as well do away with sliding (overlapping) windows altogether, and just have disjoint windows:

$ bedops --chop 100000 genome.bed > genome.disjoint100kWindows.bed

Again, you would need to use the 51% threshold as described above, so that a mapped element that straddles two windows only gets assigned to one or the other.

ADD REPLY
1
Entering edit mode

I agree with Alex Reynolds that the small differences might come from reads counted multiple times. In the example code I've provided the windows are not overlapping but there's no check preventing a read to be counted twice in a similar way to bedops. In any case, if you need to ensure that every read will be counted just once you could tweak the call to numOverlaps adding the parameter min.pctB to specify the minimum overlap fraction of the reads in a similar way to the fraction-map parameter in bedops. In general, you can use most of the parameters of regioneR::overlapRegions to fine tune the counting.

ADD REPLY

Login before adding your answer.

Traffic: 1441 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6