Question

How to find out total # of CpGs sites from a fasta file?

0

Entering edit mode

7.4 years ago

orlando.wong ▴ 60

Hello biostars,

I want to find all the # of CpG sites in Chr3.fa from UCSC.

Opening the chr3.fa in microsoft Word and using the CTRL + F, Find and Replace crashed the program. I tried to make a C++ script to count the CGs in a test file.

#include <istream>
#include <cstdlib>
#include <iostream>
#include <fstream>
using namespace std;

int main()
{

        //Open the file:
        ifstream inputfile;
        inputfile.open("/mnt/i/home/Desktop/test.txt");


        //Get length of the file:
        inputfile.seekg (0, inputfile.end);
        int length = inputfile.tellg();
        inputfile.seekg (0, inputfile.beg);

        // read each base in the file for C
        int i;
        int cgcounter;
        char firstbuffer[1]; // Check if that is C
        char secondbuffer[1]; // Check if this is G
        for (i=0; i<= length; i++)
                {

                        inputfile.seekg(i, ios::beg);
                        inputfile.read(firstbuffer,1);
                        if (firstbuffer[1] == 'C' || firstbuffer[1] == 'c') 
                                {

                                     inputfile.seekg(i+1, ios::beg);
                                     inputfile.read(secondbuffer,1); 
                                                if (secondbuffer[1] == 'G' || secondbuffer[1] == 'g') 
                                                        {
                                                                cgcounter++;
                                                                i++;
                                                        }


                                }
                }




        cout << "The number of CGs in your file is " << cgcounter << "\n" ;
        return 0;
}

The test.txt is a file with just AAAAAAAAAAACGAAAAAAAAAAAACGAAAAAAAAAAACG (three CGs), but when I run the script, I get different outputs each time:

 labyu@DESKTOP-U037B9F:/mnt/i/home/Workspac$ g++ cgcounter.cpp
labyu@DESKTOP-U037B9F:/mnt/i/home/Workspac$ ./a.out
The number of CGs in your file is 635585824
labyu@DESKTOP-U037B9F:/mnt/i/home/Workspac$ ./a.out
The number of CGs in your file is 711083296

The code compiles fine, but I am not sure where these random numbers are coming from. I am very new to C++ and C. Alternatively, is their an easier way to find the # of CGs in a fasta/txt file?

Thanks!

sequence Cpp fasta CpG • 3.9k views

ADD COMMENT • link updated 11 months ago by Ram 43k • written 7.4 years ago by orlando.wong ▴ 60

3

Entering edit mode

7.4 years ago

shenwei356 8.4k

Using locate (usage) command of SeqKit, which provides executable binary files for Windows/Linux... Just download, decompress and run.

Test data in FASTA format:

$ cat test.txt 
>test
AAAAAAAAAAAC
GAAAAAAAAAAA
ACGAAAAAAAAA
AACGcgc

Locating the CGs:

$ ./seqkit locate --ignore-case --only-positive-strand --pattern "(CG)+" test.txt 
seqID   patternName     pattern strand  start   end     matched
test    (CG)+   (CG)+   +       12      13      CG
test    (CG)+   (CG)+   +       26      27      CG
test    (CG)+   (CG)+   +       39      42      CGcg

Aligned result:

$ ./seqkit locate --ignore-case --only-positive-strand --pattern "(CG)+" test.txt | column -t 
seqID  patternName  pattern  strand  start  end  matched
test   (CG)+        (CG)+    +       12     13   CG
test   (CG)+        (CG)+    +       26     27   CG
test   (CG)+        (CG)+    +       39     42   CGcg

But, what's this for?

ADD COMMENT • link 7.4 years ago by shenwei356 8.4k

0

Entering edit mode

Thanks shen wei! I am trying to compute coverage of CpG sites from some datasets from the Infinium Methylation 450k. It has about 450,000 CpG sites and I wanted to know how much of chr 3's CpG sites were covered.

ADD REPLY • link 7.4 years ago by orlando.wong ▴ 60

3

Entering edit mode

7.4 years ago

dariober 14k

I think this could be done with Linux utilities (assuming you are on a Unix system)

grep -v '>' chroms.fa | grep -o -i 'CG' | wc -l

Edit: Of course this answers only the question How many CpG in total, it doesn't tell you where they are.

It's curious that you are switching from a very basic solution (MS Word) to an overly engineered one (C++)!

For the future, maybe worth looking into Python for this sort of text processing.

ADD COMMENT • link 7.4 years ago by dariober 14k

2

Entering edit mode

this will miss patterns like:

nnnnnnG
Cnnnnnn

ADD REPLY • link 7.4 years ago by shenwei356 8.4k

0

Entering edit mode

Indeed! That's quite an embarrassing oversight!

ADD REPLY • link 7.4 years ago by dariober 14k

2

Entering edit mode

or you can transfrom all the non GC to '\n'

grep -v "^>" chroms.fa  | tr -d "\n" | tr -c "[GCgc]" "\n" | grep -v '^$' | grep -o -i 'CG' | wc -l

ADD REPLY • link 7.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks so much! That command worked on a huge file so quickly. I will take your advice and learn some Linux utilities and Python for simplicity.

Also, will that count the lower case 'cg' or 'cG' or 'Gc'?

ADD REPLY • link 7.4 years ago by orlando.wong ▴ 60

1

Entering edit mode

Take care that my command is buggy (see shenwei356's comment). Use Pierre's version if anything. But yes, grep -i is case insensitive and will match cg or CG. And yes, grep is quite fast.

ADD REPLY • link 7.4 years ago by dariober 14k

score 4 · Accepted Answer · 2016-12-16

4

Entering edit mode

7.4 years ago

Pierre Lindenbaum 161k

A simple C program:

#include <stdio.h>


int main(int argc,char** argv)
    {
    long N=0L;
    int prev=-1;
    for(;;)
        {
        int c=fgetc(stdin);
        switch(c)
            {
            case EOF: case '>': {
                if(N>0) {
                    printf("\t%ld\n",N);
                    }
                if(c==EOF) return 0;
                fputc('>',stdout);
                while((c=fgetc(stdin))!=EOF && c!='\n') {
                    fputc(c,stdout);
                    }
                N=0L;
                prev=-1;
                break;
                }
            case '\n':break;
            case 'g':case 'G': {
                if(prev=='C') N++;
                prev='G';
                break;
                }
            case 'c': case 'C': {
                if(prev=='G') N++;
                prev='C';
                break;
                }
            default: {
                prev=c;
                break;
                }
            }
        }
    return 0  ;
    }

e.g:

 wget -q -O - "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr3.fa.gz" | gunzip -c | ./a.out 
>chr3   9469074

ADD COMMENT • link 7.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks Pierre!

I am trying to learn C as well. This is easy to understand. Could you explain a few parts of the program? How come we can store a character 'C' or 'G' in the prev variable if we declared it as an integer earlier? Is there any significance to initializing it as -1.

ADD REPLY • link 7.4 years ago by orlando.wong ▴ 60

1

Entering edit mode

How come we can store a character 'C' or 'G' in the prev variable if we declared it as an integer earlier?

the size of an integer is greater than the size of a char sizeof(char)<sizeof(int), so you can always store a char into a 'int'. Furthermore, fgetc returns a 'int', not a 'char' http://www.cplusplus.com/reference/cstdio/fgetc/

Is there any significance to initializing it as -1.

-1 is arbitrarty but I'm sure it is initialized as a non interesting value like 'A', 'T', '>', '\n' ...

ADD REPLY • link 7.4 years ago by Pierre Lindenbaum 161k