Question

BAI index format

0

Entering edit mode

3.0 years ago

maxrwjones ▴ 60

Hi all,

I work with wheat which has annoyingly large chromosomes (most >600 Mbp), and I've stumbled across this limitation of the BAI indexing format (which indexes .bam files) in the samtools index manual:

"The BAI index format can handle individual chromosomes up to 512 Mbp (2^29 bases) in length. If your input file might contain reads mapped to positions greater than that, you will need to use a CSI index."

I don't suppose anyone knows if this limitation has been quietly fixed over the years? If not I'm a bit stuck as it seems that the .bam/.bai combination of files is required for some key pieces of software I'd like to use, e.g. IGV and ATACseqQC (Bioconductor). Unless the CSI index format is functionally interchangeable with BAI for such software?

Cheers!

BAI index BAM CSI • 2.7k views

ADD COMMENT • link 2.9 years ago by maxrwjones ▴ 60

1

Entering edit mode

As of v.2.4.x IGV supports .csi format index files.

ADD REPLY • link 3.0 years ago by GenoMax 141k

1

Entering edit mode

but as it it based on htsjdk, IGV doesn't support the long chromosomes (length> INT32_MAX)

ADD REPLY • link 3.0 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Hi Pierre, thanks for your response. Could you elaborate on what (length> INT32_MAX) means? Does it translate to 2^32 bases?

ADD REPLY • link 3.0 years ago by maxrwjones ▴ 60

1

Entering edit mode

yes , that is what I meant. Htsjdk use int32 for the number of bases in the reference

ADD REPLY • link 3.0 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thank you, that's very helpful! And is that per reference sequence? e.g. if I had, say, 20 chromosomes in my reference genome each <512 Mb (2^29), then the total comes out to more than 2^32. Would this still work?

ADD REPLY • link 3.0 years ago by maxrwjones ▴ 60

1

Entering edit mode

no, the 2^32 is the max for ONE chromosome, not for the size of the genome (2^64)

ADD REPLY • link 3.0 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks for confirming that, cheers!

ADD REPLY • link 3.0 years ago by maxrwjones ▴ 60

0

Entering edit mode

Thanks for your response! Good to know that .csi works in IGV. Unfortunately I'll need to use other downstream software too. I'm guessing from the fact that it was only since v.2.4.x that IGV supported it that it is by no means 'default' for a programme utilising .bai to also be able to use .csi?

ADD REPLY • link 3.0 years ago by maxrwjones ▴ 60

1

Entering edit mode

not a lot of tools even use the BAM indexes, only tools that need to sub-select a BAM file for a specific coordinate range for example, so it tends to be a genome browser type use case. you could also consider converting to CRAM and if the tool supports CRAM and not CSI, CRAM index should work with it (CRAM uses a CRAI index, not CSI and allows large genomes). You also mention FAI, FAI is a text based format, not a binary one, so wouldn't generally have these limitations on maximum int size, and also FAI is for FASTA indexing (e.g. FAI lets you get a slice of the FASTA at some given coordinates without reading whole file) not BAM (BAI/CSI allow getting a slice of BAM without reading the whole file)

ADD REPLY • link 3.0 years ago by cmdcolin ★ 3.8k

1

Entering edit mode

Hi cmdcolin, thanks for your response!

RE FAI, that was typo on my part, apologies, I did mean BAI. I have edited to reflect this.

Interesting to hear that the BAM indexes aren't that widely utilised... that's good news for sure! However, even if there isn't a requirement for an index, might it be the case that a piece of software can't handle reference sequences >2^29 bases?

e.g. I have since tried to use an R/Bioconductor package called ATACseqQC and it threw up an error relating to reference genome size:

Error in h(simpleError(msg, call)) : error in evaluating the argument 'X' in selecting a method for function 'sapply': 'end' must be <= 536870912 file: A_50K_CER_sorted_unimappers.bam index: A_50K_CER_sorted_unimappers.bam

The number quoted in the error (536870912) is equal to 2^29.

ADD REPLY • link 3.0 years ago by maxrwjones ▴ 60

0

Entering edit mode

You can break your largest chromosomes into two (or more pieces, if needed) pieces so the individual pieces stay under the limit. Then you should be able to use all tools (in theory) with this reference. You will need to figure out how to manage/keep track of the overlaps (if you choose to break chromosomes that way).

ADD REPLY • link 3.0 years ago by GenoMax 141k

0

Entering edit mode

Indeed, that's the conclusion I've come to as well. Bit of a pain, as will need to re-coordinate annotation files for the split chromosomes too. Thanks for your insight!

ADD REPLY • link 2.9 years ago by maxrwjones ▴ 60