Question: Why do we need a .fai file and a .dict file of the reference during alignment and variant calling using GATK?
Entering edit mode

I'm trying to learn the theory behind various steps in variant calling using GATK. Before alignment using BWA-MEM we first index the reference genome and this generates a set of files with the extensions

chr13and17.fa.amb chr13and17.fa.ann chr13and17.fa.bwt chr13and17.fa.pac

where chr13and17.fa is the FASTA file containing the reference genome.

The next step in the pipeline is generating a .fai using samtools with the command:

samtools faidx chr13and17.fa

Followed by generating a .dict file using Picard:

java -jar picard.jar CreateSequenceDictionary R=chr13and17.fa O=chr13and17.dict

I want to know WHY we generate a .fai file and a .dict file despite also indexing the genome. In the samtools manual, the reason for creating a .fai file is specified as:

Using an fai index file in conjunction with a FASTA/FASTQ file containing reference sequences enables efficient access to arbitrary regions within those reference sequences.

Isn't 'efficient access to arbitrary regions of the genome' also the aim of indexing? I understand the files themselves store different information in different, well, formats. But why all the different files though?

ADD COMMENTlinkeditmoderate 9 months ago Cookie-san • 10 • updated 9 months ago Pierre Lindenbaum 120k
Entering edit mode

AFAIK only the GATK and Picard tools need the dict files. You're right, fai and dict are both index files in a manner of speaking, but they are optimized for different functions, and while fai is more prevalent, Picard tools are tooled to work better with dict files. Check out this thread on a similar topic: .dict file created by picard and by samtools

About all the different files that you encounter, here's my take: Welcome to Bioinformatics :-)

ADD REPLYlinkeditmoderate 9 months ago
Entering edit mode

index for bwa-mem : burrow-wheeler transform index used to map the reads.

index fai : used by the tool to list the chromosome and quickly fetch a sequence from the fasta sequence

dict: list the chromosomes but also provides informations about the MD5 Sum of the fasta sequences (to be sure that you're using the same REF), the name of the organism(s), the names for aliases, the URL where we can retrieve the sequences, etc... this dict file will be inserted in/compared with the BAM and VCF headers

ADD COMMENTlinkeditmoderate 9 months ago Pierre Lindenbaum 120k

Login before adding your answer.

Powered by the version 2.0