Biostar Beta. Not for public use.
Finding and plotting gaps in denovo assembly in R
1
Entering edit mode
13 months ago
alslonik • 90
Israel

Hi,

I am working on an assembly of a genome and currently am trying to annotate and visualize the regions of consecutive Ns. I would like to see the regions of my newly assembled genome, that are gaps (NNNNn) .

The way i tried to do it is with Letterfrequencyinslidingwindow command of Biostrings, but it takes forever (more than half an hour) for 1 scaffold, and have many of them. Later on, I make a dataframe out of the output matrix and try to plot it in order to see the regions where Ns are consecutive. Same goes for plotting it. It really takes a lot of time.

The command I use is:

Freq_N_758 <- sapply(chromium.assembly["758"], letterFrequencyInSlidingView, 1, "N")

I am sure that I miss a very important point here and do it the wrong way. What is the correct way to do it in R?

Many thanks, Alex

ADD COMMENTlink
2
Entering edit mode
14 months ago
National Institutes of Health, Bethesda…

Try something like this using the Biostrings Bioconductor package.

library(Biostrings)
x = DNAString("ACTGNNTTGGNNNNAACTGC")
y = maskMotif(x,'N')
z = as(gaps(y),"Views")
ranges(z)
as.data.frame(ranges(z))

The final output from above will be:

  start end width
1     5   6     2
2    11  14     4

This will run nearly instantaneously for pretty much arbitrary sizes of sequence.

ADD COMMENTlink
0
Entering edit mode

That's exactly what i needed. THANKS!

ADD REPLYlink
0
Entering edit mode

Feel free to "accept" the answer so that others know that the question is answered. : )

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1