Multi-Sequence Alignment For Many Groups Of Genes
1
0
Entering edit mode
10.3 years ago
biolab ★ 1.4k

Dear all, How to make muli-sequence alingment for many groups of genes? I make an example here.

>gene1_human
ATTTGCGTGACTGACTGC
>Gene2_human
GCGCGCATGATCCGATGACTG
>gene3_human
TGATACGATGCTGACTGACTGAC
......

>gene1_fly
ATTTGCGTGACTCTGC
>Gene2_fly
GCGCATGATCCGATGACTG
>gene3_fly
TGATACGATGCTGACTGTGAC
......

>gene1_worm
ATTTGCGTGACTCTGaC
>Gene2_worm
GCGCATGATCCGATGccACTG
>gene3_worm
TgtgGATACGATGCTGACTGTGAC
...

>gene1_mouse
ATTTGCGTGACTCTGaC
>Gene2_mouse
GCGCATGATCCGATGccACTG
>gene3_mouse
TgtgGATACGATGCTGACTGTGAC
......

I need to separately compare each gene from these species. The output likes below. They have the same length with gaps marked as - Does anyone know how to perform this analysis? Please give me some suggestions. Thank you very much!

Human   ATTTGCGTGACTGACTG-C
Mouse   ATTTGCGTGACT--CTGAC
Worm    ATTTGCGTGACT--CTGAC
Fly     ATTTGCGTGACT--CTG-C
alignment • 2.8k views
ADD COMMENT
1
Entering edit mode
10.3 years ago
Pavel Senin ★ 1.9k

Will clustalw work for you?

CLUSTAL 2.1 multiple sequence alignment


gene1_worm       ----ATTTGCGTGACT--CTGAC----
gene1_mouse      ----ATTTGCGTGACT--CTGAC----
gene1_fly        ----ATTTGCGTGACT--CTGC-----
gene1_human      ----ATTTGCGTGACTGACTGC-----
gene3_fly        ----TGATACGATGCTGACTG--TGAC
gene3_worm       -TGTGGATACGATGCTGACTG--TGAC
gene3_mouse      -TGTGGATACGATGCTGACTG--TGAC
gene3_human      ----TGATACGATGCTGACTGACTGAC
Gene2_human      GCGCGCATGATCCGATGACTG------
Gene2_fly        --GCGCATGATCCGATGACTG------
Gene2_worm       --GCGCATGATCCGATGCCACTG----
Gene2_mouse      --GCGCATGATCCGATGCCACTG----
                       :*..   ..*  *:

edit: yes, MUSCLE is another option, especially, it the sequences vary in their length

Gene2_human      ---GCGCGCA----TGATCCGATG--ACTG
Gene2_fly        -----GCGCA----TGATCCGATG--ACTG
Gene2_worm       -----GCGCA----TGATCCGATGCCACTG
Gene2_mouse      -----GCGCA----TGATCCGATGCCACTG
gene3_human      ---TGATACGATGCTGACTGACTG--AC--
gene3_worm       TGTGGATACGATGCTGACTG--TG--AC--
gene3_mouse      TGTGGATACGATGCTGACTG--TG--AC--
gene3_fly        ---TGATACGATGCTGACTG--TG--AC--
gene1_human      ---ATTTGCG----TGACTGACTG---C--
gene1_fly        ---ATTTGCG----TGACTC--TG---C--
gene1_worm       ---ATTTGCG----TGACTC--TG--AC--
gene1_mouse      ---ATTTGCG----TGACTC--TG--AC--
                         *     ***     **   *
ADD COMMENT
0
Entering edit mode

Thanks! I just worry that I have many genes, when clustering all sequences together, some of them may not be well aligned. for example, the last C in fly and human gene1 are not aligned very well. Ideally I can run multi-sequence alignment for each gene in batch. What's your ideas? I don't know the CLUSAL algorithm.

ADD REPLY
0
Entering edit mode

clustering, as the process, is somewhat different from multiple alignment. you can pre-process you large dataset (how many sequences?) with cd-hit - which will cluster, i.e. partition, it (can handle large data) - then you can apply clustalw to clusters in order to obtain multiple alignments

ADD REPLY
0
Entering edit mode

Hi Pavel, I have ~3000 sequences. What tools for the cd-hit command? Could you please say a liitle bit more on the partition or what tools can be used for pre-process so many sequences? One of my further question is that can I run CLUSTALX(W) by command? In this way I can run it 3000 times. Thank you very much!

ADD REPLY
0
Entering edit mode

You could try to run clastalw on your set of 3K sequences as it is, I guess. CD-HIT is the software used to partition (cluster) a large dataset into groups of similar sequences. Configured by a similarity threshold and other parameters you can adjust the way it makes those groups. Assuming that sequences within a group (a cluster) are similar - clustalw will perform multiple sequence alignment for them significantly faster. Yes, you can install clustalw on your computer, and you don't need to run it 3000 times, just once.

ADD REPLY
0
Entering edit mode

Hi Pavel, I have one more question. I have just installed Clustalw. However, after typing clustalw I found a file input window pop up, then another window pop up. It's a step-by-step mode. How can I run it once by command? I can prepare 3000 gene sequence files, but don't know how to run clustalw in batch? When you have free time, could you please write to me a command for batch running clustalw (default parameters are ok)? Thank you in advance!!

ADD REPLY
1
Entering edit mode

IMO muscle is way better than clustal..

ADD REPLY
0
Entering edit mode

sure! let's put that example too.

ADD REPLY
0
Entering edit mode

Of course, these algorithms are designed for homologs, somehow I'm getting the idea that you're trying to align non-related sequences, which wouldn't make any sense in almost any context. More than that, if these are protein-coding genes, you should be aligning amino acids instead of nucleotides..

ADD REPLY
0
Entering edit mode

Hi 5heikki, I need to align nucleotide sequences rather than protein sequences. I am using targetscan to find miRNA targets in various species. The input file should be mutisequence alignment. Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 2758 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6