I have just downloaded the following 'gc content' documenting file: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/hg19.gc5Base.txt.gz
The file contains data for all chromosomes in hg19. *it is not in bed format which makes it difficult to parse in terms of separating into chromosome files *
>head hg19.gc5Base.txt
variableStep chrom=chr1 span=5
10001 40
10006 40
10011 40
10016 60
10021 60
10026 60
10031 40
10036 40
10041 40
Has anyone come accross a way to separate this file into separate chromosome files with a lower resolution- e.g. its currently at a 5 bp resolution but I want a 500000 bp resolution.
My current way of doing this (extremely inefficient) is (done in R):
data=read.table('hg19.gc5Base.txt', sep='\t', header = F, fill=T)
head(data)
V1 V2
1 variableStep chrom=chr1 span=5 NA
2 10001 40
3 10006 40
4 10011 40
5 10016 60
6 10021 60
7 10026 60
8 10031 40
9 10036 40
library(zoo) ## to smooth this data
idx=grep(data[,1], pattern='variable') ### find each position where new chromosome starts
for(i in c(1:length(idx))){
if(i==23){i='X'}
if(i==24){i='Y'}
smoothed=rollapply(data[c(idx[i:i+1]),2], width=2, function(x) mean(x, na.rm=T))
write.table(smoothed,paste0('chr',i), sep='\t', row.names = F, col.names = F, quote=F)
}
This is taking an extremely long time... does anyone know of a better more efficient way of doing this?
it's a wig file. see http://bedops.readthedocs.io/en/latest/content/reference/file-management/conversion/wig2bed.html for converting to bed