Question

faidx implementation with C++

0

Entering edit mode

6.7 years ago

bdelolmo ▴ 10

Hello,

I am learning C++, and after learning STL I would like to do my own implementation of faidx to perform sequence extraction from a genome. I know there are nice APIs out there (seqAns, etc) but I want to understand the principles behind it. I have tried to understand the faidx implementation on samtools C API, but it' too much for learning the basics so I am a bit lost at the begining. Any suggestion?

FASTA Cpp indexing • 1.9k views

ADD COMMENT • link updated 11 months ago by Ram 43k • written 6.7 years ago by bdelolmo ▴ 10

score 2 · Answer 1 · 2017-08-18

Read http://www.htslib.org/doc/faidx.html, which discusses the format of _.fai_ files and how they are used to index well-formatted FASTA files. Working from this will be clearer than working from the implementation in HTSlib, and when you are more familiar with it the HTSlib/Samtools implementation may make a bit more sense.

score 1 · Answer 2 · 2017-08-19

with the following fasta file :

>chr1
ACGTATATATATGTAT
ACGTATATATATATAT
>chr2
ACGTATATATTATATT
ACTATATATATTATAT
ACTATATATATATATT
GTATATATAAATTATA

the faidx file contains :

chr1    32      6       16      17
chr2    64      46      16      17

One line per fasta header.

column 1 : header name
column 2 : sequence size ( how many nucleotid)
column 3 : file offset where the sequence start. ( How many caracter from the top of the file )
column 4 : line size ( how many nucleotid per line without cariage or line return )
column 5 : full line size ( how many caracter with line return )

Now, if you would like to get a sub sequence efficiently, read the index and create the following function:

string sequence(string chr, int start, int end)
{
           Index index = allIndex[chr];
           // where file reading start 
           int offset = index.offset + start + start / index.chompLineSize;
           // move into file 
           file.seek(offset);
           // int length = end - start
           return file.read(length + (length)/index.chompLineSize);

}

Here is my own implementation using Qt/C++ https://github.com/labsquare/CuteFasta/blob/master/src/FastaFile.cpp#L10

score 1 · Answer 3 · 2017-08-19

You already have some pretty good answers, but I wanted to point out that I've done a Python implementation here: https://github.com/mdshw5/pyfaidx/blob/master/pyfaidx/__init__.py#L537

There's a lot of object-oriented stuff to get around, but the docstring explains the logic behind fetching a subsequence:

    Fetch the sequence ``[start:end]`` from ``rname`` using 1-based coordinates
    1. Count newlines before start
    2. Count newlines to end
    3. Difference of 1 and 2 is number of newlines in [start:end]
    4. Seek to start position, taking newlines into account
    5. Read to end position, return sequence