Question

How to construct multiple sequence alignment from group of pairwise alignments

0

Entering edit mode

7.3 years ago

kanshenglong • 0

I have a bunch of 10 or more very closely related DNA sequences (from different strains) aligned to a reference chromosome. How can I generate multiple sequence alignment along with "reference" of all these sequences without affecting individual alignments to the reference. Don't cut the blank sequence.

for example:

Input:

Original reference:
CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCC

Pairwise alignments:

Ref1: CGACAAT--GCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCC
Seq1: CGACAATAAGCACGACAGAGGAAGCAGAACAGATA-----ATTGCCTCTCATTTTC-CTCCC

Ref1: CGACAATGCACGACAGAGGAAGC--AGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCC
Seq2: CGACAAT-CACGACAGAGGAAGCTTAGAACAGATATTTAG---GCCTCTCATTTTCTCTCCC

Ref1: CGACAATGCACGACAGAGGAAG----CAGAACAGATATTTAGATTGCCTCTCA----TTTTCTCTCCC
Seq3: CGACAATGCACGACAGAGGAAGTTTTCAGAACAGATATTTAGATTGCCTCTCAAAAATTTTCTCTCCC

Output: Final Multiple sequence alignment:

Ref1: CGACAAT--GCACGACAGAGGAAG----C--AGAACAGATATTTAGATTGCCTCTCA----TTTTCTCTCCC
Seq1: CGACAATAAGCACGACAGAGGAAG----C--AGAACAGATA-----ATTGCCTCTCA----TTTTC-CTCCC
Seq2: CGACAAT---CACGACAGAGGAAG----CTTAGAACAGATATTTAG---GCCTCTCA----TTTTCTCTCCC
Seq3: CGACAAT--GCACGACAGAGGAAGTTTTC--AGAACAGATATTTAGATTGCCTCTCAAAAATTTTCTCTCCC

alignment • 2.3k views

ADD COMMENT • link updated 7.3 years ago by GenoMax 141k • written 7.3 years ago by kanshenglong • 0

0

Entering edit mode

Someone has asked this question, But I can't run this script. http://www.perlmonks.org/bare/?node_id=866127

Thank you !!

ADD REPLY • link 7.3 years ago by kanshenglong • 0

0

Entering edit mode

I'm confused. you want to take the aligned query sequence from several pairwise alignments, and compare them all against one reference?

ADD REPLY • link 7.3 years ago by Joe 21k

0

Entering edit mode

jrj.healey:

 Thanks for your reply!
 I'm sorry I did not make it clear. As my sequences are much longer than the reference sequence, I can't align these sequences by Multiple Sequence Alignment. I want this result.

enter image description here result

ADD REPLY • link 7.3 years ago by kanshenglong • 0

0

Entering edit mode

You may not get an optimal answer by default run of an MSA program in this type of a case. You may have to do some manual editing of the alignments to achieve the exact result you want (editing within reason).

ADD REPLY • link 7.3 years ago by GenoMax 141k

0

Entering edit mode

HAving to do some editing (within reason) to a multiple sequence alignment is pretty normal. I wouldn't put that in the bucket of MSA not giving you optimal results. I spent my entire PhD doing highly divergent MSA and phylogenetic analysis, this is definitely a case that MSA is entirely appropriate for.

ADD REPLY • link 7.3 years ago by DG 7.3k

0

Entering edit mode

I meant to say that the answer from a default run of a MSA program may not be usable as is. I have amended the post above.

ADD REPLY • link 7.3 years ago by GenoMax 141k

score 0 · Answer 1 · 2017-01-19

I'm not sure if this answers your question, I'm still not totally clear on the objective but here goes.

Given a reference sequence ("REF") and 2 dummy subject sequences (SEQ1 & SEQ2), perform pairwise alignments like you have, and get the output in fasta format, thus:

>REF
ATATACAGATATCC
>SEQ1
ATAT-CTGATA-CC

and

>REF
ATATACAGATATCC
>SEQ2
--ATACAG-TATCC

Then extract all the aligned fasta sequences exluding the reference:

$ python ignore1st.py alignment.fasta > alignment1.fasta #repeat for however many alignments you have, giving each output filename a different name/number.

# ignore1st.py

from Bio import SeqIO
import sys

handle = list(SeqIO.parse(sys.argv[1], "fasta"))


for each in handle[1:]:
    print(each.format("fasta"))

Yeilds:

$ python ignore1stfasta.py aln1.fasta
>SEQ1
ATAT-CTGATA-CC

$ python ignore1stfasta.py aln2.fasta
>SEQ2
--ATACAG-TATCC

Then I would simply concatenate all your aligned fasta sequences after the reference (I'm going to assume for now you either have your reference sequence as a separate file already, or are capable of extracting it from the alignment files):

$ cat aligned_reference.fasta alignment*.fasta > resultfile.fasta

>REF
ATATACAGATATCC

>SEQ1
ATAT-CTGATA-CC

>SEQ2
--ATACAG-TATCC

This doesn't account in anyway for the quality of your PSA though.

score 0 · Answer 2 · 2017-01-19

0

Entering edit mode

7.3 years ago

DG 7.3k

I'm not sure why you think you can't do a standard multiple sequence alignment just because your sequences of interest are longer than the reference? Since they are closely related and longer, you gain information by doing a standard multiple sequence alignment, as the "new" sequences will also be aligned against each other. Unless these sequences are absurdly long I don't see why it isn't amenable to a standard analysis.

ADD COMMENT • link 7.3 years ago by DG 7.3k

0

Entering edit mode

This was my thought too hence my confusion. Global alignment should handle it I would think?

ADD REPLY • link 7.3 years ago by Joe 21k

0

Entering edit mode

If the posted example in OP is all there is then yes. But since actual sequences (being compared to the reference) are longer (as depicted in the post above) then global alignments may not be appropriate.

ADD REPLY • link 7.3 years ago by GenoMax 141k

0

Entering edit mode

MSA != global alignment, although as I posted above to your other comment, MSA, including doing global alignments, is entirely appropriate. What is a "reference" sequence that you are comparing to is actually largely irrelevant in this case for the purposes of constructing a MSA. In fact you actually gain information from using your other sequences in the alignment process.

ADD REPLY • link 7.3 years ago by DG 7.3k