Question

Strategy for generating a consensus sequence for 100 complete bacterial genomes?

0

Entering edit mode

5.8 years ago

Alec Watanabe ▴ 60

Greetings,

I am working with a 100 complete M. tuberculosis genomes in FASTA format. What I want to do is align all the sequences to search for common genomic regions between all the strains. MAUVE was the only program that I found that could handle this big set of data. Any ideas on how to generate a consensus sequence with these common genomic regions from MAUVE? Is there any other program that can handle such big data and make a consensus sequence? I tried PhyDE but MUSCLE could only align a tiny initial portion of the genome. PhyDE would haven been ideal since it can align and make a consensus sequences, but it does not even work with two whole genomes.

Appreciate the attention.

consensus seq mauve • 2.5k views

ADD COMMENT • link 5.8 years ago by Alec Watanabe ▴ 60

0

Entering edit mode

Are you looking to create a pan genome sequence (consensus sequence may not make sense since the strains could differ significantly)? If so here is a list of software from OmicTools.

ADD REPLY • link 5.8 years ago by GenoMax 141k

0

Entering edit mode

Hi,

At first I thought the sequences were considerably different, but after running Gegenees, the heatmap showed a similarity of 99%+ between all strains. Constructing a pan genome sequence is not a bad idea, but I am worried if by doing that I can end up excluding intergenic portions that could be interesting. I forgot to mention but the main objetive is to run a primer design.

ADD REPLY • link 5.8 years ago by Alec Watanabe ▴ 60

0

Entering edit mode

So would it make sense to make more focused regional comparisons (rather than trying to create a general consensus) to assist with primer design?

ADD REPLY • link 5.8 years ago by GenoMax 141k

0

Entering edit mode

Yes, a more focused comparison would be ideal for both time and computational power. But the problem is that I currently do not know which genomic regions to consider since I can't see what is common or not. Gegenees only generates the heatmap but does not tell what specifically is common or different. Maybe there is another way to do it but I am not seeing it. The only solution I've come up with is to make whole genome alignment to see what's common. After the alignment, I would make a consensus sequence and run the primer design. Although MAUVE aligns portions, they are still different in short portions, but maybe I'll need to do a manual checking and selecting of regions.

ADD REPLY • link 5.8 years ago by Alec Watanabe ▴ 60

0

Entering edit mode

That is why I was suggesting pan genome tools. Panseq

Panseq determines the core and accessory regions among a collection of genomic sequences based on user-defined parameters. It readily extracts regions unique to a genome or group of genomes, identifies SNPs within shared core genomic regions, constructs files for use in phylogeny programs based on both the presence/absence of accessory regions and SNPs within core regions.

While you don't need the other features, if it extracts regions unique to a genome or group of genomes, identifies SNPs within shared core genomic regions then that should get you started.

ADD REPLY • link 5.8 years ago by GenoMax 141k

0

Entering edit mode

Nice, I thought when you said pan genomic analysis, it would only consider genes and exclude intergenic regions, but the description of Panseq says "core and accessory regions", sorry. I will read the documentation of Panseq, see if it suits the purpose, and test it if so. I'll let you know if the software does succeed. Many thanks!

ADD REPLY • link 5.8 years ago by Alec Watanabe ▴ 60

0

Entering edit mode

This was just one example. There are other similar programs so take a wider look. Good luck.

ADD REPLY • link 5.8 years ago by GenoMax 141k

0

Entering edit mode

Just one question. how does the output file of Pansew look like?

ADD REPLY • link 5.8 years ago by Alec Watanabe ▴ 60

0

Entering edit mode

Look for ##Description of output files on the page linked above.

ADD REPLY • link 5.8 years ago by GenoMax 141k

0

Entering edit mode

Hi,

I've been testing Panseq since yesterday and got some results today. I noticed that Panseq creates some PHYLIP files. I never used PHYLIP before, but what PHYLIP program should I use to open the "binary.phylip" and "snp.phylip" files?

ADD REPLY • link 5.8 years ago by Alec Watanabe ▴ 60

0

Entering edit mode

If you are not interested in phylogenetic relationships then you could safely ignore phylip files. PHYLIP is not the easiest program to use but you can find a guide here.

ADD REPLY • link 5.8 years ago by GenoMax 141k

0

Entering edit mode

I see. I'm sorry for asking too many questions. I am new to this Panseq program as you know. So if I interpreted correctly, the "binary_table.txt" file shows the pan-genome and also in which strains the genomic fragments are present or absent, right? So, if I choose the fragments in which all the strain possess a "1", theoretically, it is present in all strains and thus are in the "core genome", right? Now, about the "coreGenomeFragments.fasta" file, it shows the fragments that are present in the "core genome", I manually checked some fragments and apparently some of them are not present in all strains, even though the program says so, is it normal?

ADD REPLY • link 5.8 years ago by Alec Watanabe ▴ 60

0

Entering edit mode

@Alec: I am sorry but I can't help you with this. I have not used panseq myself. My suggestion was based on your requirement.

Perhaps someone else may be along. You could also create a new post with this question.

ADD REPLY • link 5.8 years ago by GenoMax 141k

0

Entering edit mode

Many thanks @genomax ! You've helped me a lot just by suggesting the program. I will post another question about this.

ADD REPLY • link 5.8 years ago by Alec Watanabe ▴ 60

0

Entering edit mode

I forgot to ask, but is this difference related to the "percentIdentityCutoff" value that we configure in the settings.txt file? For example, if I choose a value of 100, will it print out only exact sequence matches across all strains?

ADD REPLY • link 5.8 years ago by Alec Watanabe ▴ 60