Create a reference genome from aligned bam file
3
1
Entering edit mode
8 months ago
Vincent ▴ 20

Hi,

I would like to create a fasta file from consensus of a bam file previously aligned on a reference genome.

Here's an example:

Fasta from Reference genome : CHR_1 ATGATACGATGCTACGTAGCTACGTACGATCGTACGATCGATCGATCGA

Fasta from Bam file consensus : CHR_1 ATGATACGATGCTACNNNNNNNNNNNNNNNNNNNNNNNCGATCGATCGA

I tried with samtools Fasta function but the fasta file obtained doesn't keep the chromosome nor the position of the read. So I end up with a fasta file containing thousand of "chromosome" named something like AMY1X:05260:04932

If i'm not clear enough don't hesitate to ask more question and thank for your time!

bam bam2fasta fasta • 1.5k views
ADD COMMENT
0
Entering edit mode

You can use samtools and create FASTA reference that has a 1:1 coordinate correspondence with the original reference you used in alignment step. This will not include insertions.

samtools consensus -a --show-ins no in.bam -o ref.fa

If you want to include insertions, you can do sth like this.

samtools consensus -f fastq in.bam -o cons.fq
ADD REPLY
0
Entering edit mode

I wonder why you would want to do this. If you want to create a new genome, you should look into genome assembly and not first align to a reference genome.

ADD REPLY
0
Entering edit mode

Because we tried GBS on a population from Self-pollination and we want to see if we'll get a better alignement and snpcalling if we remove region that has not been sequenced on the parent that has been self pollinated.

ADD REPLY
3
Entering edit mode
8 months ago
Michael 54k

Samtools consensus should be the right tool for what you are trying to do. I would use the default Gap5-like algorithm.

ADD COMMENT
1
Entering edit mode
8 months ago
Vincent ▴ 20

Thanks to Michael , here was the solution:

samtools consensus -a -o sample.fa -d 5 sample.sort.bam

where :

  • -a make the chromosome as long as the reference genome even if you don't have reads that cover that area
  • -o sample.fa is the output name
  • -d 5 is the minimum depth
  • sample.sort.bam is the input sorted bam file that i wanted to convert as a fasta file
ADD COMMENT
0
Entering edit mode
8 months ago
iaagosch • 0

So, the bam file you have is an alignment from fastq files to a reference genome. And I'm also assuming you did it yourself and have access to the files, so you can test out other alignment softwares.

A note is that these commands are after the alignment using bowtie2, I don't know if they work with alignments in other softwares, such as BWA.

These are the options I used with bowtie2 for my alignment: --no-unal --no-hd --no-sq --no-mixed --no-discordant

After that disclosure, you can try these steps:

Sort and index the bam file with:

samtools sort example.bam -o example.sorted.bam
samtools index example.sorted.bam

Generate a vcf file and with it, a consensus between your aligned reads and the reference genome:

bcftools mpileup -Ou -f reference.fasta example.sorted.bam | bcftools call -mv -Oz --ploidy 1 -o calls.vcf.gz
tabix calls.vcf.gz
cat example.fasta | bcftools consensus calls.vcf.gz > consensus.fasta

You can remove the vcf file after, if you want and it should keep the reference headers.

This is a quick script that I wrote for one of my classes back in 2018, so some commands might (or not) have changed.

ADD COMMENT
1
Entering edit mode

While this should work to generate a fasta file via a variant calling step, it should be noted that the name of the bcftools consensus function is somewhat misleading (see the documentation). In fact, bcftools consensus should better be named "alternate reference maker" as in GATK. By default, it will replace every reference position with its ALT allele if present irrespective of coverage or frequency. This is very different from what many people would expect under the term consensus.

So, this pipeline would give you a genome with all positions replaced with their alternate allele even if a large majority of reads supports the reference allele. Lacking a variant filtration step, this would also add a lot of noise to your genome.

ADD REPLY
0
Entering edit mode

I didn't know about that, thank you for the information! (:

ADD REPLY
0
Entering edit mode

Thanks it's a start! I now achieve to get a fasta looking just like my reference genome, except that i only have N and SNP.

What I have now : NNNNNNNNNNANNNNNNNNNNNNNNNNNNTNNNNNNNNN

What I would like: NNNNNNTGCCATGCATCGATCGTACGTGGCNNNNNNNNNN

So is there a way to get the whole sequence of the read (20-350bp) instead of the snp?

Thanks

ADD REPLY
0
Entering edit mode

For reasons I have explained above, this approach should not be used. Please try samtools consensus.

Also, there is another mistake in the pipeline, given as it is, that has likely caused all the genomic positions to become 'N'.

ADD REPLY

Login before adding your answer.

Traffic: 1667 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6