BAM format file to FASTA alignment file
1
0
Entering edit mode
9.2 years ago
ben.ward ▴ 40

Hi,

I've been given a reference genome and BAM file(s) which are reads aligned to that reference. This will be the first time I've done anything with BAM/SAM formats. I'd like to know how I would go about generating for each BAM file, the full sequence, using the reads in the BAM file and the reference they are aligned to, essentially resulting in a FASTA format alignment of the reference, and then each sequence represented by each BAM file. What are the considerations? For example at the top of my head, whether reads are paired end and their strandedness matters.

I'd like to know if this can be done in R / Bioconductor as I have some familiarity with it - mostly using Biostrings.

Thanks,

Ben.

BAM SAM R FASTA alignment • 8.2k views
ADD COMMENT
0
Entering edit mode

I re-read your question and I think you actually want a fasta format alignment. You actually probably don't want that. You are new to BAM/SAM and so you might not realize yet that it would be pretty awful having it in a different alignment format. Take a look at the actual file or the first several lines and just see what the file means before moving onto something more familiar like fasta.

e.g., samtools view file.bam | head -n 999 | column -t | less -S

ADD REPLY
3
Entering edit mode
9.2 years ago

The simplest method for generating the consensus sequence is to simply use bcftools consensus. An example is given in the documentation. Obviously, you'll need to have called the variants. GATK has a similar tool.

ADD COMMENT
0
Entering edit mode

In samtools v1.1, you can use samtools bam2fq.

Usage:   samtools bam2fq [-nO] [-s <outSE.fq>] <in.bam>

Options: -n        don't append /1 and /2 to the read name
         -O        output quality in the OQ tag if present
         -s FILE   write singleton reads to FILE [assume single-end]
ADD REPLY
0
Entering edit mode

I've interpreted OP as requesting the consensus sequence rather than reproducing fastq files.

ADD REPLY
0
Entering edit mode

Bcftools consensus command does:

Create consensus sequence by applying VCF variants to a reference fasta file

so I think that it is far from being straightforward to use this tool for obtaining the consensus of mapped reads which does not contain bases from the reference sequence - which I believe is what OP was looking for. I'm now solving the same task (to reconstruct consensus of mapped reads and not reference sequence modified by VCF variants). I'm using samtools mpileup + bcftools:

samtools mpileup -ABuf reference.nt mapped.bam | bcftools call -cOz --pval-threshold 0.99 > mapped.vcf.gz
tabix mapped.vcf.gz 
cat reference.nt | bcftools consensus mapped.vcf.gz > mapped.fastq

... and the only consensus I can get still contains nucleotides from reference at some positions. In case you could suggest a specific commands or parameters how to use bcftools consensus to generate read consensus that would be really helpful.

ADD REPLY

Login before adding your answer.

Traffic: 1929 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6