Biostar Beta. Not for public use.
Concatenate SNPs for phylogenetic
1
Entering edit mode
6.0 years ago
e.martinez • 10
Australia

Dear All,

I have a list of SNPs from different samples, which present common and different SNPs in respect to a common ancestor. I'm interest to create concatenate files of the SNPs for phylogenetic studies. The problem is that I have samples that don't present a SNPs in a particular position and I need to fill this gap with the reference nucleotide so I obtain perfect concatenate files. This is because if not phylogenetic programs ignore complete this position. I have done it through lockup in excel but when you have more than 30,000 is quite time consuming. Some one knows a better way to do it?

SNP alignment • 3.0k views
ADD COMMENTlink
0
Entering edit mode
Can you show us what your snp list looks like? Sounds like you need to write a simple script, but users will need details to provide a helpful answer
ADD REPLYlink
0
Entering edit mode

When I extract my snps from vcf files into excel they look like this

1849 SNV C A
2532 SNV C T
9143 SNV C T
11820 SNV C G

for one isolate, and for following isolates the positions of the snps can differ

2532 SNV C T
2586 SNV G T
9143 SNV C T
11370 SNV C T

If I do a multi-alignment in Mega it looks like below, were there is not snp it will be a gap. This reduce my analysis as mega ignore columns with gaps. I need to fill the gap with the ref nucleotide in a better way that lock up in excel. Any ideas?

ADD REPLYlink
0
Entering edit mode

ACACAGGGGCCCGCGAACCCAGCGCGGCCAA
ACACAGGG-CTCGCGAAGCCAGCGCTGCCAA
ACAGAGGGTCCCGCGAAGCCAGCGCTGCCAA
ACACAGGGTCTCGCGAACCCAG-GCTGCCAA

ADD REPLYlink
0
Entering edit mode

So, you can probably skip the excel stage (always a good idea!) and use one of these solutions https://www.biostars.org/p/6553/

ADD REPLYlink
0
Entering edit mode

Thanks David. The link have a variety of solutions that by sure will help me.

ADD REPLYlink
0
Entering edit mode

I try the solutions in the link, and they are good, but it finishing giving me the whole genome. If I try to do multi alignment and run in beast it will take forever. Thus, I was using only a list of positions from my reference and then find it them in the vcf files from my samples, which imply that I still don't have a better way to do it than in excel. Any other suggestions. Thanks.

ADD REPLYlink
0
Entering edit mode

You have a few options that I can see:

Take your massive alignment and drop non-variables sites (some MSA viewers do this, it wouldn't be hard to script it if you're dealing with very large genomes)
Use a "combine variants" tools from bcftools/gatk/picard to make a single VCF for all your variants, extract the sites from the vcf
Use bedtools intersect to create a set of polymorphic sites, use the resulting bedfile to extract only the variable sites from the reference genome

Which is best will depend on what you've already done and what tools you are condifent with, but it should be possible

ADD REPLYlink
1
Entering edit mode
2.6 years ago
Cytosine • 440
Ljubljana, Slovenia

If samples are really related (e.g. strains from the same organism), why not try a multi-sequence alignment on the concatenated snp sequences, without filling the gaps with reference nucleotides?

ADD COMMENTlink
0
Entering edit mode
5.9 years ago
Boston, MA USA

I agree with the response from Cytosine. I would add that it might be useful to add a length of reference sequence to each end of the concatenated SNPs in order to guide the alignment and have no overhangs or unaligned portions.

ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3