Extract consensus dna sequence of multiple genes from multiple bam files in fasta format
1
1
Entering edit mode
6.7 years ago
bioinfo8 ▴ 230

Hi,

I have multiple bam files (aligned with reference) and bed file of multiple genes (chr, start, end). I tried as follows for one bam file which gave me only reads information in fasta file :

samtools view -b -L genes.bed 1sorted.bam > 1sorted_genes.bam

samtools fasta 1sorted_genes.bam > 1sorted_genes.bam.fasta

I would like to extract the DNA sequence of whole genes (and not only reads) from all my bam files, including the gene name. I know I can add gene name in bed file, but how gene name can be integrated into fasta file. I want to translate DNA sequences to proteins for further downstream work.

Any guidance would be appreciated.

Thanks.

EDIT1: I have paired-end bam files.

samtools fasta genes bam consensus • 3.4k views
ADD COMMENT
0
Entering edit mode

samtools fasta will not generate a reference fasta. It turns each read from a one line sam format into two line fasta format. samtools.pl should have pileup2fq, that is closer to what you want.

ADD REPLY
0
Entering edit mode

Thanks, but I don't need reference fasta but consensus fasta sequence of all reads belonging to a gene which I can translate to protein.

ADD REPLY
1
Entering edit mode
6.7 years ago
Sej Modha 5.3k

Please have a look a the answers from this post: How to extract fasta sequence of a gene from BAM files generated using Miseq sequencing

ADD COMMENT
0
Entering edit mode

Thanks, but I have already gone through that, its not helpful, hence posted.

ADD REPLY
0
Entering edit mode

I have tried all the following ways :

1) samtools mpileup -uf ref.fa aln.bam | bcftools call -c | vcfutils.pl vcf2fq > cns.fq 

2) samtools bam2fq 1sorted_genes.bam > genes_sorted.fq

3) seqtk seq -a  genes_sorted.fq >  genes_sorted.fa

My question is similar to Reference Mapping To Consensus Fasta but answers are not helpful.

Kindly guide.

Thanks.

ADD REPLY
0
Entering edit mode

I have tried all the following ways

Just saying that without providing any information on what happened (did you get an error, did not work as expected or some thing else) you have little chance of getting useful help.

ADD REPLY
0
Entering edit mode

@genomax Ok. Actually, I wrote in the starting of my post that "gave me only reads information in fasta file" and not consensus of all reads belong to a sequence. "I would like to extract the DNA sequence of whole genes (and not only reads) from all my bam files, including the gene name. I know I can add gene name in bed file, but how gene name can be integrated into fasta file. I want to translate DNA sequences to proteins for further downstream work."

I have used Bam2Consensus tool from BamBam : the only tool I find to get direct consensus for a gene, but for most of my genes it shows NNNNN mostly with few nucleotides:

>gene
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGGGCTGGTCTTCATGTCCNNNNNNNNNNN

and for few genes proper nucleotide appears (without N).

Hence, I would like to try other option but did not find any other tools so far.

ADD REPLY
0
Entering edit mode

but for most of my genes it shows NNNNN mostly with few nucleotides

I am assuming you have checked the alignments to make sure you have read coverage across the entire gene/area of interst and you are using appropriate options for bam2consensus (I have not used this tool and it may need a certain number of reads covering a base in order for it to call a consensus). If there are no reads present then consensus tool is likely going to put N's in the area as padding.

ADD REPLY
0
Entering edit mode

That is what I also thought when I read samtools tview description in many places where N indicates no alignment with reference. But, I was thinking if I can confirm it with some other tools too.

Also, for those genes (although very few) where there are no or very few N's, I can easily translate to protein and do further analysis. But, those genes with so many N's, how should I translate them or what approach I should follow.

ADD REPLY
0
Entering edit mode

But, I was thinking if I can confirm it with some other tools too.

Have you inspected the alignments with a genome viewer like IGV? Are the alignments uniform/sufficiently deep/without too many sequencing errors, to be able to reliably call a consensus. There is no point in trying to call a consensus if you have sparse alignments/no coverage across the entire gene of interest.

If you have N's (and no real reads covering that region) there is not much you can do but to get additional sequencing done to try to get coverage in that area.

ADD REPLY
0
Entering edit mode

Yes I visualized for few genes and not aligned throughout but I can see variants for few regions inside those genes (I already did variant calling). I also want to mention here that these are exomes (paired-end bam files).

As you suggested, sequencing from more samples would be helpful.

ADD REPLY

Login before adding your answer.

Traffic: 2773 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6