Question

Any way to differentiate from a BAM which minor build human reference genome was used for alignment?

0

Entering edit mode

8.1 years ago

Irene@Sequencing.com ▴ 270

If not stated in the BAM header/metadata, is there a way to ascertain from a BAM file which minor build of the human reference genome was used to generate that BAM file?

I've encountered a BAM file that was aligned using a minor build of GRCh38 but I'm unable to determine which specific minor build (such as whether it was GRCh38.p2, GRCh38.p3, GRCh38.p4, GRCh38.p5 or GRCh38.p6).

reference genome BAM • 2.6k views

ADD COMMENT • link 8.1 years ago by Irene@Sequencing.com ▴ 270

0

Entering edit mode

Attempted to implement Istvan's suggestion above but hit a wall and was not able to get it to work. Any other suggestions for a way to ascertain from a BAM file which minor build of the human reference genome was used to generate that BAM file?

ADD REPLY • link 8.1 years ago by Irene@Sequencing.com ▴ 270

0

Entering edit mode

What data do you have in the head of your BAMs? Looking at http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/data/ i see that the total lengths, etc change between minor builds (panel on the right). Maybe there's a way to use that?

If we could see the samtools view -H of your BAM file that might give us more hints. For example, the BAMs I produce via Picard have the MD5 for each contig there, and a whole bunch of obscure scaffold names.

But really, the way Istvan described is really the best way, since often the sequence is simply updated between patches, not the total chromosome size, etc.

ADD REPLY • link 8.1 years ago by John 13k

0

Entering edit mode

Thanks for the info. The problem is that this isn't just one BAM file as I'm working on a universal solution regardless of what tool was used for alignment.

While the contig MD5 approach may work for Picard I'm not sure if it is applicable to other common aligners.

ADD REPLY • link 8.1 years ago by Irene@Sequencing.com ▴ 270

0

Entering edit mode

Ah, well, Picard doesn't do the aligning, it just added the MD5 information afterwards (I think during the MergeBam step). But of course you also have to provide it the reference sequence. I map with 3 different mappers, and they all have this M5 field per contig in the header.

Anywhoo - not relevant if it's not already there. The only thing to do in this case is would be to download every single minor build, align them all to one another as best you can. Find "informative sites" where minor builds differ. Compare to the BAM. It's certainly not an easy problem to solve.. perhaps a graph genome will help you here - where deviations from the latest reference genome is tagged with the minor version.

ADD REPLY • link 8.1 years ago by John 13k

score 3 · Answer 1 · 2016-03-02

3

Entering edit mode

8.1 years ago

Istvan Albert 100k

I have not done this myself so there may be a simpler way to do it but it got me thinking. Here is one possible workaround

Find a FIX patch (sequence update) that corresponds to a minor build.

Create a consensus with samtools mpileup from your bam file for that region and align it to back to the patch. If it matches the path you found the minor build.

FAQ in patches:

http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/info/patches.shtml

List of patches:

http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/?filters=type:novel,fix#current-regions

ADD COMMENT • link 8.1 years ago by Istvan Albert 100k

0

Entering edit mode

Brilliant! Adding this to my bag of tricks.

ADD REPLY • link 8.1 years ago by harold.smith.tarheel ★ 4.9k

0

Entering edit mode

Thank you, Istvan. I've been working to implement your suggested approach but haven't yet been able to get it to work.

(1) I aligned a BAM file to GRCh38.p4.

(2) Tried to use this patch to determine the minor build version of this BAM file: http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/issues/?id=HG-2213

I used samtools mpileup command to generate VCF:

/home/samtools/bin/samtools mpileup -uf /mnt/data/refData/1051/1051.fa -r 18:48,514,982-48,684,706 -v sorted.bam > 1.vcf

1051.fa is reference file acquired through assembling separate FASTA files at ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000001405.19_GRCh38.p4/GCA_000001405.19_GRCh38.p4_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/

(3) I specified range for HG-2213 patch.

(4) I then tried to generate a consensus sequence:

First I've tried to use bcftools /home/bcftools/bin/vcfutils.pl vcf2fq 1.p.vcf > 1.fq and tried to use final file to align through bowtie2 by using reference file with HG-2213 contig only (I've extracted it from ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000001405.19_GRCh38.p4/GCA_000001405.19_GRCh38.p4_genomic.fna.gz file), but bowtie2 complained about incorrect source file:

Error: Read 18 has more read characters than quality values. terminate called after throwing an instance of 'int' Aborted (core dumped)
As an alternative, I tried using GATK to produce alternate reference. I'm not sure this is the appropriate to do and also unsure what to do with the alternate reference that was produced (which took about 20 minutes to generate).

So I've hit a wall while trying to implement your suggested approach above.

Theoretically, it is clear that this patch (HG-2213) must be reflected in the BAM file but practically I am having trouble figuring out how to detect it.

Please let me know if you have any further suggestions/advice. Thank you!

ADD REPLY • link 8.1 years ago by Irene@Sequencing.com ▴ 270

2

Entering edit mode

actually now that I did some research myself there is a chicken and egg problem here. In order to generate the consensus with the tools you have you will need the original reference, but in fact that is what you want to determine. While there are probably tools to generate a new consensus reference these rather fall into the class of de-novo assembly.

Here is another idea, try something simple.

Find a location that has sequence changes
Compare the way your reads in the BAM file look against the reference genome in these same locations

What IGV does when the sequence changes is that it highlights all mismatches as if were SNPs. You should be able to see the sequence changes (if there are any) very clearly.

ADD REPLY • link 8.1 years ago by Istvan Albert 100k

0

Entering edit mode

Thank you for the additional suggestion.

I attempted to use a BAM file to compare against references but it looks rather tricky and appears to fall into task of reconstructing the reference from the BAM file, which I don't think can be possible in diverse range of cases (since it appears to greatly depend on the types of the alignments).

Is there a tool that could be used to reconstruct reference FASTA (or just specific regions of it) for a given BAM so that I can automate the process of minor build detection?
Is there a tool that that will determine the sequence in the patch regions directly from the BAM file (without requiring ref genome used for alignment to be known)?

ADD REPLY • link 8.1 years ago by Irene@Sequencing.com ▴ 270

0

Entering edit mode

I wish I understood better exactly just how many of what kind of changes go into a minor build. Perhaps this would be a good time to do that :-)

How about a brute force approach:

Extract the reads from the BAM file
Realign these reads against each minor build (or a chromosome that has been affected by the changes)
The realignment that is most similar to the original BAM file might indicate the right minor build. Of course here is important to use the exact same parameters.

One might not need even need to realign all data just say some percent of it.

ADD REPLY • link 8.1 years ago by Istvan Albert 100k