Detecting Polymorphic Sites In Multiple Sequence Aligned Files
3
3
Entering edit mode
12.0 years ago
ngsgene ▴ 380

I have used ClustalW2 to get a multiple sequence aligned file (15 sequences for protein coding region). I can see the alleles that are different at specific positions for example my clustalw2 alignment is such.

seq2            AAAAAT 6
seq3            AAAAAT 6
seq1            AAAAAA 6
seq4            TTAAAA 6
                ::***:

How can I make a table of the different alleles that exist at one position. For example at position 3,4,5 I have all A's but at position 6 I have two A's and two T's

Is there a tool to make a table like this:

pos  1 6 
seq2 A T
seq3 A T
seq1 A A
seq4 T A

Thanks!

Edited: I was looking for a tool to make the job easier (and be assured of results for a large dataset) guess there is nothing out there except writing your own script.

multiple clustalw • 6.0k views
ADD COMMENT
11
Entering edit mode
12.0 years ago
raunakms ★ 1.1k

Here is a short script which uses the Bioperl module Bio::AlignIO. This script will read your alignment file (alignment_file.aln) and parse each column of the alignment at a time and then extracts the aligned nucleotide sequence from every alignment creating a tab separated table. The table is printed in an output file (table_ouput.txt).

The table will look like this (note: the number in the table indicates the column number):

1    A    A    A    T
2    A    A    A    T
3    A    A    A    A
4    A    A    A    A
5    A    A    A    A
6    T    T    A    A

Then you can write another script to parse the table and do any type of calculation you want to do !!!

Using Sequence Logo can also give you a rough idea of your over all sequence. Here is the sequence logo for your alignment above:

alt text

Here is the script to parse the alignment:

#!usr/bin/perl/ -w
use strict;
use warnings;

use Bio::AlignIO;
use Bio::LocatableSeq;

my $align_file = 'alignment_file.aln';
my $out_file = 'table_ouput.txt';

my $str = Bio::AlignIO->new('-file' => $align_file);
my $aln = $str->next_aln();

my $seq1 = $aln->get_seq_by_pos(1);
my $seq2 = $aln->get_seq_by_pos(2);
my $seq3 = $aln->get_seq_by_pos(3);
my $seq4 = $aln->get_seq_by_pos(4);

open(OUTPUT, ">$out_file") or die "error";

for (my $col = 1; $col<=$aln->length; $col++) 
{
    my $char_seq1 = $seq1->subseq($col, $col);
    my $char_seq2 = $seq2->subseq($col, $col);
    my $char_seq3 = $seq3->subseq($col, $col);
    my $char_seq4 = $seq4->subseq($col, $col);

    print OUTPUT $column, "\t";
    print OUTPUT $char_seq1, "\t";
    print OUTPUT $char_seq2, "\t";
    print OUTPUT $char_seq3, "\t";
    print OUTPUT $char_seq4, "\n";

}

close OUTPUT;
exit;
ADD COMMENT
0
Entering edit mode

Thanks Raunak, I can always transpose the output file to get the results as intended (rows to columns) using Unix. I looked into sequence logo as Niek de Klein mentioned, and I liked the visualization.

ADD REPLY
3
Entering edit mode
12.0 years ago
Niek De Klein ★ 2.6k

I don't know if you want to use the table for further processing, or you only want to use it for visualisation. In the latter case you'll probably want to make a sequence logo. This webserver makes a logo out of a multiple sequence alignment (see the examples if you don't know what a sequence logo is). It takes fasta, clustalw and flat format (see this for explanation of input).

ADD COMMENT
0
Entering edit mode

From this table I have to detect which site was derived (from the ancestor) and plot them to see freqeuency.. though its interesting to see what a sequence logo is. Thanks for your response.

ADD REPLY
0
Entering edit mode
12.0 years ago
User 2005 ▴ 70

I would suggest MySQL if your table is less than 100MB. Contact in PM if you need more info :)

ADD COMMENT
0
Entering edit mode

I fail to see why I would need a table in MySQL, I am referring to multiple sequence alignments to get polymorphic sites, a table for reference not an RDBMS table

ADD REPLY

Login before adding your answer.

Traffic: 2522 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6