usage:

Question

Extract Cg Dinucleotides Coordinates From (Human) Genome

2

Entering edit mode

11.0 years ago

Philippe ★ 1.9k

Hi everybody,

to perform some analyses I need to extract the coordinates of all CG dinucleotides from the human genome (hg19). It would not be extremely difficult to write such a script but, for several reasons, I would prefer to rely on already existing tools or processed datasets.

However, I was surprised that I found nothing on Google or on data repository such as UCSC,...

Is there to your knowledge such a tool or a downloadable dataset?

Thanks in advance for your answers.

genome • 5.9k views

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 11.0 years ago by Philippe ★ 1.9k

Ram · Answer 1 · 2013-04-05

3

Entering edit mode

11.0 years ago

Pierre Lindenbaum 161k

something like this ?

#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>

#define SEQNAME_MAX BUFSIZ

int main(int argc,char** argv)
    {
    int c;
    int pos=0;
    int prev_c=-1;
    char name[SEQNAME_MAX];
    name[0]=0;

    for(;;)
            {
            switch((c=fgetc(stdin)))
                {
        case EOF: return EXIT_SUCCESS;
        case '>':
            {
            int space=0;
            int name_length=0;
            name[0]=0;
            pos=0;
            while((c=fgetc(stdin))!=EOF && c!='\n')
                {    
                if(space) continue;
                if(isspace(c)) { space=1; continue;}
                name[name_length++]=c;
                }
            name[name_length]=0;
                    prev_c=-1;
            break;
            }
        case '\n':case '\r':case ' ':break;
        case 'g': case 'G': 
            {
            if(prev_c=='C')
                {
                printf("%s\t%d\n",name,pos);
                }
            prev_c=c;
            ++pos;
            break;
            }
        case 'c': c='C';/* continue */
                default:prev_c=c; ++pos;break;
                }
        }
    return EXIT_SUCCESS;
    }

.

$ gcc -O3 -Wall input.c
$ curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chrM.fa.gz" |  gunzip -c | ./a.out  | head
chrM    33
chrM    61
chrM    78
chrM    80
chrM    91
chrM    96
chrM    105
chrM    120
chrM    162
chrM    170

ADD COMMENT • link 11.0 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks for the code Pierre. I coded something similar (in python) but I was more interested in published solution as a comparison.

ADD REPLY • link 11.0 years ago by Philippe ★ 1.9k

0

Entering edit mode

I can upload the code on figshare if needed :-P

ADD REPLY • link 11.0 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Hi Philippe and Pierre,

Did you ever find a solution to this? I am very interested in finding all the CpG sites in the human genome also?

-Diviya

ADD REPLY • link 10.5 years ago by diviya.smith ▴ 60

0

Entering edit mode

Great script, Pierre.

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by PoGibas 5.1k

score 2 · Answer 2 · 2017-02-01

Anyone want to validate my perl script. save this script as: cgpositionFinder.pl

#!/usr/bin/perl
# search the postions for the CpGs in human genome
# input are chr1.fa, chr2.fa and so on
# output will be chr1.CpG.positions.txt and so on.
# Finally, merge all the CpG.positions.txt and bedtools sort.  

my $fa=shift @ARGV;
open F,$fa;
my $seq;
my($chr,undef)=split/.fa/,$fa;
my $out="$chr.CpG.positions.txt";
open F,$fa;
while(<F>){
next if />/;
chomp;
$seq .=$_;
}
close F;

sub match_positions {
    my ($regex, $string) = @_;
    return if not $string =~ /($regex)/;
    return (pos($string), pos($string) + length $1);
}
sub all_match_positions {
    my ($regex, $string) = @_;
    my @ret;
    while ($string =~ /($regex)/ig) {
        push @ret, [(pos($string)-length $1),pos($string)-1];
    }
    return @ret
}

my $regex='CG';
my $string=$seq;
my $cgap=3;
my @pos=all_match_positions($regex,$string);
my @hgcg;
open OUT,">$out";
foreach my $pos(@pos){
    push @hgcg,@$pos[1];
}
foreach my $i(0..($#hgcg)){
my $pos=$hgcg[$i]-1;  # transfer to 0-based coordinate
print OUT "$chr\t$pos\n";
}
close OUT;

usage:

perl cgpositionFinder.pl chr1.fa
perl cgpositionFinder.pl chr2.fa
perl cgpositionFinder.pl chr3.fa
........

output

chr1.CpG.positions.txt
chr2.CpG.positions.txt
chr3.CpG.positions.txt
........

example:

wget http://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/chromFa.tar.gz
tar xzvf chromFa.tar.gz
for i in `ls *fa`
do
perl ~/bin/cgPositionFinder.pl $i
done