How To Calculate The Identify Of Each Site For A Sequence Alignment?
3
0
Entering edit mode
10.4 years ago
biolab ★ 1.4k

Dear all, I have an alignment of several sequences. I need ot make statistics ot the identify of each nt site. Would you please give a hand? Thank you very much.

alignment • 3.3k views
ADD COMMENT
0
Entering edit mode

Can you make clear what '..calculate the identify ..' means ?

ADD REPLY
0
Entering edit mode

Sorry for unclear descirption. Actually my analysis is to calculate the frequency of each site for an alignment. For example, 3 sequences have been aligned as shown below. I want to know the frequency of these four sites (No.1: 100% A, NO.2: 100% T, NO3: 100%G, NO. 4: 33.3%C 66.6%G). THANKS!

>seq1
ATGC
>seq2
ATGG
>seq3
ATGG
ADD REPLY
0
Entering edit mode

i further make some comments: my sequences are the same in length. So I just need to get the nt composition information for each position. THANKS.

ADD REPLY
0
Entering edit mode

do you allow gaps in your alignment ?

ADD REPLY
0
Entering edit mode

no gaps in the alignment. Actually my sequences are all 21 nt miRNAs.

ADD REPLY
1
Entering edit mode

Then it would be easy. Put all sequences in array, make loop with increment counter by array indices, then calculate similarity percentage in each loop count

ADD REPLY
0
Entering edit mode

Hi World, I have encounted problem with perl. I am a true beginner of perl. I tried my best to write a script, but not working. I show it below. This is an example: I have one alignment file that contains 4 sequences, each of which is 21 nt long. Actually I also need to run more alignment files in batch, but do not know how to use @ARGV. Could you please take a look at the following script? THANKS!

#!/usr/bin/perl
use strict;
use warnings;

open SEQ, "seqaln.txt"; #one file only, need to batch input many files
my @seq=<SEQ>;

    my $j = 0; 
    while ($j < 21){
    my $o = substr($seq[2],$j,1);
    my $p = substr($seq[4],$j,1);
    my $q = substr($seq[6],$j,1);
    my $r = substr($seq[8],$j,1);
    $j++;
print "$o\t$p\t$q\t$r";
}

close SEQ;
ADD REPLY
0
Entering edit mode

Note I don't know how to calculate similarity percentage, so just print the four nts. Please modify my script. THANKS

ADD REPLY
0
Entering edit mode

I finally worked out a script for displaying the nt at each position for 4 sequences. for example

>s1
ATGCACTGACTGTATGACTG
>S2
ATGGTGACTGTGACTGACTG
>S3
ATGGACCATGACTGCATGTG
>S4
ATCCACTGTGACGTGCAACA

My script is as follows:

#!/usr/bin/perl
use strict;
use warnings;

open SEQ, "seqaln.txt"; 
my @seq=<SEQ>;

    my $j = 0; 
    while ($j < length($seq[1])){
    my $o = substr($seq[1],$j,1);
    my $p = substr($seq[3],$j,1);
    my $q = substr($seq[5],$j,1);
    my $r = substr($seq[7],$j,1);
    $j ++;
print "$j\t$o\t$p\t$q\t$r\n";
}

close SEQ;

my output result is:

1       A       A       A       A
2       T       T       T       T
3       G       G       G       C
4       C       G       G       C
5       A       T       A       A
6       C       G       C       C
7       T       A       C       T
8       G       C       A       G
9       A       T       T       T
10      C       G       G       G
11      T       T       A       A
12      G       G       C       C
13      T       A       T       G
14      A       C       G       T
15      T       T       C       G
16      G       G       A       C
17      A       A       T       A
18      C       C       G       A
19      T       T       T       C
20      G       G       G       A
21
22

As a beginner i tried best to produce this output. However, I need to calculate in each position (here from position 1 to 20) the frequency of ATGC. How to achiev that? Could anyone give a hand? THANKS!

ADD REPLY
1
Entering edit mode
10.4 years ago
samsara ▴ 630

Here is python script. I think this is what you want.

from numpy import matrix
from collections import defaultdict

Seq1="ATCCTTAGGT"
Seq2="ACCCTACAAT"
Seq3="TTCTGGCAGC"

SequenceLength=len(Seq1)

A = matrix( [list(Seq1),list(Seq2),list(Seq3)])   

for i in range(0,SequenceLength):
    dic = defaultdict(int)
    for j in range(0,len(A)):
        dic[A[j,i]] += 1
    Str="Position:"+str(i+1)
    if dic['A'] !=0:
        Str = Str + " A: " + str(float(dic['A'])/len(A) *100)

    if dic['T'] !=0:
        Str = Str + " T: " + str(float(dic['T'])/len(A) *100 )

    if dic['G'] !=0:
        Str = Str + " G: " + str(float(dic['G'])/len(A) *100)

    if dic['C'] !=0:
        Str = Str + " C: "+ str( float(dic['C'])/len(A) *100)

    print Str

Output:

Position:1 A: 66.6666666667 T: 33.3333333333
Position:2 T: 66.6666666667 C: 33.3333333333
Position:3 C: 100.0
Position:4 T: 33.3333333333 C: 66.6666666667
Position:5 T: 66.6666666667 G: 33.3333333333
Position:6 A: 33.3333333333 T: 33.3333333333 G: 33.3333333333
Position:7 A: 33.3333333333 C: 66.6666666667
Position:8 A: 66.6666666667 G: 33.3333333333
Position:9 A: 33.3333333333 G: 66.6666666667
Position:10 T: 66.6666666667 C: 33.3333333333
ADD COMMENT
0
Entering edit mode

THANKS a lot for the python script!

ADD REPLY
0
Entering edit mode

At least upvote if you like the answer

ADD REPLY
0
Entering edit mode

I have upvoted these answers, which are very helpful! (Sorry, i am new in Biostar and have not noticed upvote). THANKS again for the python and perl scripts!

ADD REPLY
1
Entering edit mode
10.4 years ago
PoGibas 5.1k

biolab, know what you're looking for...
This "statistics ot the identify of each nt site" is called Position weight matrix (or PSWM, PSSM).
I strongly recommend going trough this tutorial (you can analyze your sequences their).
Also, similar question were asked on biostars before: Parsing a Fasta file by columns.

ADD COMMENT
1
Entering edit mode
10.4 years ago

Your script is close. What you want is something that keeps count of how often every letter comes up in each position. I'd use a hash of hashes. It's sort of like a three dimensional hash, where instead of one key having one value, each key is its own hash, with more keys and values. So

$hash{14}{'A'}++

would increment up the count of A's at position 14 by 1. $hash{14} is actually a reference to another hash that has the counts for all 4 letters. So:

$hash{$j}{$o}++

and so on for each sequence.

Then you use two nested foreach loops to get the data out

foreach my $position (keys %hash) {
     foreach my $letter ('A', 'C', 'G', 'T') {
          print "$position: $letter: ", $hash{$position}{$letter} / 4, "\n";
     }
}

Once you get your script working, you next job is to generalize it, so it works on more than 4 sequences.

ADD COMMENT
0
Entering edit mode

THANKS for your suggestions, really helpful!

ADD REPLY
0
Entering edit mode

Dear swbarnes2,

I tried my best to write a code, as shown below. However, I still get problem. could you please check it and indicate the erros for me? THANK YOU VERY MUCH.

#!/usr/bin/perl
use strict;
use warnings;

open SEQ, $ARGV[0];
my %hash = 0;
while (<SEQ>){
chomp;
my @seq = split (/ /, $_);
    for (my $i=0; $i<=$#seq; $i++){
    $hash {$i}{$seq[$i]} +=1;
    }
    }
close SEQ;
    foreach my $posi (keys %hash){
        foreach my $nt ('A', 'T', 'G', 'C'){
    print "$posi", "$nt", %hash{$posi}{$nt}/4, "\n";
        }
    }

Error message

syntax error at aln comp_2.pl line 18, near "%hash{"
Global symbol "$nt" requires explicit package name at aln comp_2.pl line 18.
syntax error at aln comp_2.pl line 18, near "},"
syntax error at aln comp_2.pl line 20, near "}"
Execution of aln comp_2.pl aborted due to compilation errors.
ADD REPLY

Login before adding your answer.

Traffic: 1980 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6