Question

Creating bins based on scaffold size file

0

Entering edit mode

5.6 years ago

ggman ▴ 80

Greetings all,

I am attempting to create bins/windows based on a genome file which contains data like the one below. I was wondering if anyone has encountered a program that can create windows for you based on intervals such as 100k, 10k, 50k.. etc. While I can simply manipulate VCFtools PI output to create the desired template, I would be interested in knowing if there is a package or program out there that can do this with data similar to the one below. I am not familiar with any tools that can do this but it would helpful to be pointed in the right direction.

Sample Data

Chromo    Size
Chrm1     500000
Chrm2     400000

Desired outcome

Chromo    Start     End
Chrm1     1            100000
Chrm1     100001            200000
Chrm1     200001            300000
Chrm1     300001            400000
Chrm1     400001            500000
Chrm2     1            100000
Chrm2     100001            200000
Chrm2     200001            300000
Chrm2     300001            400000

Best,

GG

bin Assembly sequence alignment snp • 1.8k views

ADD COMMENT • link updated 5.6 years ago by Pierre Lindenbaum 161k • written 5.6 years ago by ggman ▴ 80

3

Entering edit mode

5.6 years ago

benformatics 3.9k

R version:

I'm not sure what type of file you are starting with but here is an example with a human genome .vcf file.

library(GenomicAlignments)
## the next library is only needed to import vcfs
library(VariantAnnotation)

## import your vcf files with the chromosome information
vcf <- readVcf('your_file.vcf')

## pull out the genome information (e.g. chr names and lengths) for your vcf file
genome.info <- seqinfo(rowRanges(vcf))

## tile that genome into bins of 10kb
tile.size <- 10000
tiled.genome <- tileGenome( genome.info, tilewidth=tile.size)

## export these ranges as a bedfile
library(rtracklayer)
export.bed(unlist(tiled.genome),'tiled_genome.bed')

##  you can also look at the contents of the object
> tiled.genome
GRangesList object of length 309984:
[[1]]
GRanges object with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1   1-10000      *

[[2]]
GRanges object with 1 range and 0 metadata columns:
      seqnames      ranges strand
  [1]     chr1 10001-20000      *

[[3]]
GRanges object with 1 range and 0 metadata columns:
      seqnames      ranges strand
  [1]     chr1 20001-30000      *

...
<309981 more elements>
-------
seqinfo: 286 sequences from an unspecified genome

ADD COMMENT • link 5.6 years ago by benformatics 3.9k

2

Entering edit mode

5.6 years ago

Pierre Lindenbaum 161k

$ bedtools makewindows

ADD COMMENT • link 5.6 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

5.6 years ago

h.mon 35k

I completely misread your question, go for the other (correct) answers.

~~faSplit from the UCSC command-line utilities~~

faSplit how input.fa count outRoot where how is either 'about' 'byname' 'base' 'gap' 'sequence' or 'size'. Files split by sequence will be broken at the nearest fa record boundary. Files split by base will be broken at any base. Files broken by size will be broken every count bases.

ADD COMMENT • link 5.6 years ago by h.mon 35k

score 5 · Accepted Answer · 2018-08-31

With BEDOPS bedops --chop:

$ export WINDOWSIZE=100000
$ tail -n+2 in.txt | awk -vOFS="\t" '{ print $1, "0", $2 }' | sort-bed - | bedops --chop ${WINDOWSIZE} - > out.0index.bed

Adjust WINDOWSIZE to the desired bin size. The file out.0index.bed is zero-indexed. This format is easier to use for set and map operations with bedops and bedmap etc.

But if you want a one-indexed text file with a header:

$ export WINDOWSIZE=100000
$ tail -n+2 in.txt | awk -vOFS="\t" '{ print $1, "0", $2 }' | sort-bed - | bedops --chop ${WINDOWSIZE} - | awk -vOFS="\t" '{ $2 += 1; print $0; }' | cat <(head -1 in.txt) - > out.1index.txt