Creating bins based on scaffold size file
4
0
Entering edit mode
5.6 years ago
ggman ▴ 80

Greetings all,

I am attempting to create bins/windows based on a genome file which contains data like the one below. I was wondering if anyone has encountered a program that can create windows for you based on intervals such as 100k, 10k, 50k.. etc. While I can simply manipulate VCFtools PI output to create the desired template, I would be interested in knowing if there is a package or program out there that can do this with data similar to the one below. I am not familiar with any tools that can do this but it would helpful to be pointed in the right direction.

Sample Data

Chromo    Size
Chrm1     500000
Chrm2     400000

Desired outcome

Chromo    Start     End
Chrm1     1            100000
Chrm1     100001            200000
Chrm1     200001            300000
Chrm1     300001            400000
Chrm1     400001            500000
Chrm2     1            100000
Chrm2     100001            200000
Chrm2     200001            300000
Chrm2     300001            400000

Best,

GG

bin Assembly sequence alignment snp • 1.8k views
ADD COMMENT
5
Entering edit mode
5.6 years ago

With BEDOPS bedops --chop:

$ export WINDOWSIZE=100000
$ tail -n+2 in.txt | awk -vOFS="\t" '{ print $1, "0", $2 }' | sort-bed - | bedops --chop ${WINDOWSIZE} - > out.0index.bed

Adjust WINDOWSIZE to the desired bin size. The file out.0index.bed is zero-indexed. This format is easier to use for set and map operations with bedops and bedmap etc.

But if you want a one-indexed text file with a header:

$ export WINDOWSIZE=100000
$ tail -n+2 in.txt | awk -vOFS="\t" '{ print $1, "0", $2 }' | sort-bed - | bedops --chop ${WINDOWSIZE} - | awk -vOFS="\t" '{ $2 += 1; print $0; }' | cat <(head -1 in.txt) - > out.1index.txt
ADD COMMENT
3
Entering edit mode
5.6 years ago

R version:

I'm not sure what type of file you are starting with but here is an example with a human genome .vcf file.

library(GenomicAlignments)
## the next library is only needed to import vcfs
library(VariantAnnotation)

## import your vcf files with the chromosome information
vcf <- readVcf('your_file.vcf')

## pull out the genome information (e.g. chr names and lengths) for your vcf file
genome.info <- seqinfo(rowRanges(vcf))

## tile that genome into bins of 10kb
tile.size <- 10000
tiled.genome <- tileGenome( genome.info, tilewidth=tile.size)

## export these ranges as a bedfile
library(rtracklayer)
export.bed(unlist(tiled.genome),'tiled_genome.bed')

##  you can also look at the contents of the object
> tiled.genome
GRangesList object of length 309984:
[[1]]
GRanges object with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1   1-10000      *

[[2]]
GRanges object with 1 range and 0 metadata columns:
      seqnames      ranges strand
  [1]     chr1 10001-20000      *

[[3]]
GRanges object with 1 range and 0 metadata columns:
      seqnames      ranges strand
  [1]     chr1 20001-30000      *

...
<309981 more elements>
-------
seqinfo: 286 sequences from an unspecified genome
ADD COMMENT
2
Entering edit mode
5.6 years ago
$ bedtools makewindows
ADD COMMENT
0
Entering edit mode
5.6 years ago
h.mon 35k

I completely misread your question, go for the other (correct) answers.

faSplit from the UCSC command-line utilities

 faSplit how input.fa count outRoot
where how is either 'about' 'byname' 'base' 'gap' 'sequence' or 'size'. 
Files split by sequence will be broken at the nearest fa record boundary.
Files split by base will be broken at any base.
Files broken by size will be broken every count bases.

ADD COMMENT

Login before adding your answer.

Traffic: 2049 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6