Question

Random shuffling of features leaving gene models intact

2

Entering edit mode

9.9 years ago

Michael 54k

I am looking for a tool that can randomly shuffle gff features into intergenic regions, but leaving the gene-models 'intact', so that at least all features of a gene (transcript, exon, CDS, etc.) are placed on the same contig and related features are placed fully inside the interval of their parent region. Bedtools shuffle doesn't seem to do that, I am trying:

shuffleBed -i genes.gff3 -excl genes.gff3 -g chromsizes.txt -f 0

This command distributes sub-features to different contigs and leads to invalid gene-models, if I add -chrom, features are placed on the same contig, but not all features can be placed at all and the resulting gene-models are still not valid. Does anyone maybe have some R-code for this use-case?

gff genome shuffle bedtools • 3.8k views

ADD COMMENT • link updated 2.5 years ago by Ram 43k • written 9.9 years ago by Michael 54k

0

Entering edit mode

Thanks for pointing this out - it is a feature that we should certainly provide.

ADD REPLY • link 9.9 years ago by Aaronquinlan 12k

1

Entering edit mode

9.9 years ago

Michael 54k

Here, is my first naive attempt in R using a loop over a GRanges object. It works but is painfully slow as expected. Takes about 110 seconds for 1000 features on my computer. Afaik, one cannot use apply with GRanges. The way it is implemented now works almost without re-drawing, except for too small contigs. It might be faster to draw and assign everything in one go though using apply and then just re-draw features overlapping exclude. Any comment is welcome.

require(GenomicRanges)

shuffle <- function(genome, excludes=NULL, landmark.prob=TRUE,
                   keep.gene.models=TRUE, max.retries=10000) {
### make a lookup vector from the Parent character list
  if (! is.null(genome$Parent))
    lookup.parent <- unlist(sapply(genome$Parent, function(x)if (length(x) == 0) 'NA' else x ))
  rdata <- as(excludes, "RangedData")
### Foreach feature that doesn't have a parent
### calculate gaps for placing objects:
  gaplist <- gaps(ranges(rdata), start=1, end = seqlengths(genome))

### END preapations
### iterate only over the top-level
  myset <- if ( keep.gene.models && ! is.null(genome$Parent))
    which(lookup.parent == "NA")
  else
    1:length(genome)
  for (i in myset) { 
    f <- genome[i]
### select a contig or chromosome:
    landmark <- NULL
    for (c in 1:max.retries) {
      prob <- if (landmark.prob)
        (seqlengths(genome))/max((seqlengths(genome)))
      else NULL
      landmark <- sample(seqlevels(genome), 1,  prob=prob )
### redraw if landmark doesn't have enough free space (gene length > max gap length)
      if (max(width(gaplist[[landmark]])) >= width(f)) break;
    }
    if (is.null(landmark)) {
      warning ("couldn't place feature");
      next;
    }
### end choose chromosome       
### choose a random start position from landmark (at least gene-length away from any exclude region)
### flank position intervals on start position with - feature width will ensure that.
### we have already chosen landmark such that there is enough space.
    new.start <- as.numeric(sample(as.integer(flank(gaplist[[landmark]], -1*width(f))), 1))

  #new.start <- (sample(as.integer(gaplist[[landmark]]), 1))

    offset <- new.start-start(f)
### change feature coordinates
    seqnames(genome[i]) <- factor(landmark, levels=seqlevels(f)) 
    genome[i] <- shift(genome[i], offset)
### collect all dependent features and adjust their location to the new origin of the parent
    if (keep.gene.models & ! is.null(genome$Parent)) {
     children <- get.children.i(i, genome, lookup.parent)   
      if (!is.null(children)) {
        seqnames(genome[children]) <- factor(landmark, levels=seqlevels(genome))

        genome[children] <- shift(genome[children], offset)
      }
    ## NEXT feature
    }
   }
  return (genome)
}

## look for all children recursively
## this is slightly faster than using the objects
get.children.i <- function(i, allfeatures, lookup.parent) {
  m <- match(allfeatures[i]$ID, lookup.parent)
  if is.na(m))
    return(c())
  else {
    ch <- get.children.i(m, allfeatures, lookup.parent)
    return (c(m, ch))
  }
}

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by Michael 54k

0

Entering edit mode

If you are able to represent the gene models in BED12 format, a relatively simple change could be made to the bedtools `shuffle` command. Let me know if that is an option.

ADD REPLY • link 9.9 years ago by Aaronquinlan 12k

0

Entering edit mode

Hi Aaron, I think it is possible to convert from GFF3 to BED12. It might be nice to have this option (we will be 2 here in Bergen who would use it), but please only if it is not too much effort.

ADD REPLY • link 9.9 years ago by Michael 54k

0

Entering edit mode

It is 90% of the way there already, so let me try to have a look in the next day or two. Will get back to you.

ADD REPLY • link 9.9 years ago by Aaronquinlan 12k

0

Entering edit mode

Hi Aaron, How could I use the bedtools shuffle to solve this problem?

ADD REPLY • link 8.8 years ago by Richard Yang • 0

0

Entering edit mode

Hi, OP here, I haven't checked if it is working now in bedtools, but in the meantime, you can use my perl script from the other answer.

ADD REPLY • link 8.8 years ago by Michael 54k

0

Entering edit mode

Regarding running apply on GRanges objects, you just have to split() them first and then run lapply on the resulting list. Yes, this is kind of annoying and I'm pretty sure I'd seen multiple requests for direct support of GRanges objects by mapply or apply.

ADD REPLY • link 9.9 years ago by Devon Ryan 104k

score 2 · Accepted Answer · 2014-06-03

Here's my second attempt, this time in Perl. It does more or less what I want with good speed, maybe I will add a few more options and documentation in the future. This solution is really fast enough, I tried to shuffle 100,000 features thereof 30,000 top-level genes over 30,000 contigs in a few seconds, more or less the time it takes to do the parsing and IO. Feel free to comment on possible improvements bugs, horrific code...

http://mdondrup.github.io/shuffleGff/

Hint, to place genes only in intergenic regions use the same gff file as exclude and input.