Question

Command For Merging/Joining

0

Entering edit mode

10.1 years ago

akhalili26 • 0

Hi, My protein of interest is a multiple domain protein, and I would like to prepare a "structure based multiple sequence alignment" of it (around 1000 sequences).

Since the available structures of my protein are in different conformational states, I had to make multiple sequence alignments for EACH DOMAIN SEPARATELY. Now I have 6 high quality alignments, corresponding to the six domains of my protein.

The question is that: How can I merge/attach/join these files together? Notes:

I am new to programming!
All files are in Fasta format (also i have them in Clustal format)
The length of ALL sequences in each file is the same (but the length of sequences in different files are different. For example in File 1, the length of all sequences is 120, and in file 2 all of them are 78)
Sequences are arranged base on the "ID" in these files (i.e. the sequence IDs in the files are the same & are in the same order)
I've used cat command, but it just added sequences at the end of file!
Example files are presented below (I have shown only the first 4 sequences. there are around 1000 sequences). they are not real sequences.

I appreciate your help regarding to this matter, and please let me know if you need any more information.

Cheers, Ali

#

File 1:

>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
MLGWIAKKII
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
MLGWIAKKII
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
MLGWIAKKII
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
MLGWIAKKII

File 2:

>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
--GWIAKKI-AAAA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
--GWIAKKI-AAAA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
--GWIAKKI-AAAA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
--GWIAKKI-AAAA

File 3:

>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
--GGYYYKLIAKKI-AAAA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
--GGYYYKLIAKKI-AAAA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
--GGYYYKLIAKKI-AAAA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
--GGYYYKLIAKKI-AAAA

File 4:

>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
--------HHWSGYYYKLIAKKI-AAAA--D
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
--------HHWSGYYYKLIAKKI-AAAA--D
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
--------HHWSGYYYKLIAKKI-AAAA--D
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
--------HHWSGYYYKLIAKKI-AAAA--D

File 5:

>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
AEE
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
EEE
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
EEE
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
EEE

File 6:

>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
A
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
A
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
A
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
A

So the final file should look like this: Result file:

>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
MLGWIAKKII--GWIAKKI-AAAA--GGYYYKLIAKKI-AAAA--------HHWSGYYYKLIAKKI-AAAA--DAEEA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
MLGWIAKKII--GWIAKKI-AAAA--GGYYYKLIAKKI-AAAA--------HHWSGYYYKLIAKKI-AAAA--DEEEA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
MLGWIAKKII--GWIAKKI-AAAA--GGYYYKLIAKKI-AAAA--------HHWSGYYYKLIAKKI-AAAA--DEEEA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
MLGWIAKKII--GWIAKKI-AAAA--GGYYYKLIAKKI-AAAA--------HHWSGYYYKLIAKKI-AAAA--DEEEA

multiple-alignment • 7.5k views

ADD COMMENT • link updated 10.1 years ago by User000 ▴ 690 • written 10.1 years ago by akhalili26 • 0

1

Entering edit mode

show us a few lines of your fasta files please...

ADD REPLY • link 10.1 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Updated my question. please let me know if you need any more information.

ADD REPLY • link 10.1 years ago by akhalili26 • 0

1

Entering edit mode

That looks like CLUSTAL format, or something, but is not at all fasta.

ADD REPLY • link 10.1 years ago by pld 5.1k

2

Entering edit mode

It's called FASTA multiple alignment format; similar to FASTA but gap characters ("-") are allowed - http://www.bioperl.org/wiki/FASTA_multiple_alignment_format.

ADD REPLY • link 10.1 years ago by Neilfws 49k

0

Entering edit mode

do you want to merge the sequences together or files? basically, it is the same ID?

ADD REPLY • link 10.1 years ago by User000 ▴ 690

0

Entering edit mode

I didn't get your question but if you are trying to join files side by side instead of end to end, you can try "join" in UNIX. Check here: http://www.albany.edu/~ig4895/join.htm

ADD REPLY • link 10.1 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Does all of your 1000 sequences have their name ending with "SecA"?

ADD REPLY • link 10.1 years ago by Ashutosh Pandey 12k

score 4 · Answer 1 · 2014-03-14

Here is a python script that will concat your alignments, assuming each alignment has the same id in the fasta header

import sys
import re
def __main__():
    infile  = sys.argv[1]
    outfile = sys.argv[2]

    with open(infile, 'rb') as fi:
        seqs = fi.read().split('>')[1:]
        seqs = [x.split('\n')[:2] for x in seqs]

    merge = dict()

    #index 0 is header
    #index 1 is seq
    for x in seqs:
        key = re.search('(?<=gi\|)([0-9]*)(?=\|)', x[0]).group(0)
        try:
            merge[key] = merge[key] + x[1]

        except KeyError:
            merge[key] = x[1]


    #write results to file
    with open(outfile, 'w') as fi:
       for x in merge.keys():
           fi.write('>' + x + '\n')
           fi.write(merge[x] + '\n')

__main__()

To put your individual files into a single file:

cat f1.fasta, f2.fasta, ... > bigfile.fasta

You would then run the above script with:

python myscript.py bigfile.fasta outputfile.fasta

Just remember to adjust the file names here to whatever you're working with.

score 1 · Answer 2 · 2014-03-14

1

Entering edit mode

10.1 years ago

Ashutosh Pandey 12k

If I understand correctly, you have six files (one for each domain) and each file having 1000 sequences.

1) Excluding the clustalw file for the first domain, run the following command for all the other five files. Make sure you have backed up your files so if something goes wrong, your original files remain unchanged.

sed -i ';s/.* //' File_name_2

sed -i ';s/.* //' File_name_3

........

sed -i ';s/.* //' File_name_6

2) Then use join command in unix and concatenate the first file with the second , third .. sixth file.

join File_name_1 File_name_2 ... File_name_6 > Mega_file

Hope it works.

ADD COMMENT • link 10.1 years ago by Ashutosh Pandey 12k

0

Entering edit mode

I'm new to scripting! I got this when running sed -i ';s/.* //' domain2_cut_clustal

sed: 1: "domain2_cut_clustal": extra characters at the end of d command

ADD REPLY • link 10.1 years ago by akhalili26 • 0

0

Entering edit mode

There should be a gap between "'" and domain_cut_clustal

ADD REPLY • link 10.1 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Thanks for your help. It does the job of combining the files. BUT, When I tried to use the final "Mega_file", i realized that it is not a Clustal file! since in the middle of the file some of the lines don't meet the Clustal's standard. (i. e. some of the lines have less than 60 sequence symbols. This is because, for example, the last lines of file 2, which are in the middle of the "Mega_file", have 24 sequence symbols (which i believe will be interpreted as the end of file!))

any suggestion for turning this "Mega file" into a Fasta file or Clustal file?

ADD REPLY • link 10.1 years ago by akhalili26 • 0

score 1 · Answer 3 · 2014-03-17

Essentially, what you want to do regardless of programming language is:

Parse a set of "fasta-like" files (they are FASTA multiple alignment format which for all intents and purposes is fasta)
Extract the ID, description and sequence
For each unique ID + description, concatenate the associated sequences
Print results

So here is a quick, dirty and ugly solution which employs BioPerl SeqIO. I'm assuming a Linux-like system with a command line. BioPerl installation on Ubuntu-based Linux is as simple as sudo apt-get install bioperl.

First, assuming that your sequence files are the only files in their directory and they end with the suffix ".fa", concatenate them into one file:

cat *.fa > allseqs.fa

Now the BioPerl; save this as merge_seqs.pl:

#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;

my %seqs; # a hash to hold the results
my $file = shift or die("Usage = merge_seqs.pl <fasta file>\n");
my $seqio = Bio::SeqIO->new(-file => $file, -format => "fasta");

while(my $seq = $seqio->next_seq) { 
  $seqs{$seq->display_id." ".$seq->description} .= $seq->seq;
}
# now print
foreach my $key (keys %seqs) { 
  print ">$key\n$seqs{$key}\n";
}

and run like so (output to the file merged.fa):

perl merge_seqs.pl allseqs.fa > merged.fa

score 0 · Answer 4 · 2014-03-16

0

Entering edit mode

10.1 years ago

ole.tange ★ 4.4k

This ought to do it:

unix$ cat file* | perl -pe 's/\r\n?/\n/' | perl -ne 'chomp;if(/^>/) { $id = $_ } else { $content{$id}.=$_} END{print map { "$id\n$content{$id}\n" } keys %content}'  > file.all

It will work even if the order of the sequences inside the files are flipped around or if the lengths are different inside the files. It requires all files to be small enough to fit in memory.

Edit:

Modified to accept Apple \r and Microsoft \r\n line separation, too.

ADD COMMENT • link 10.1 years ago by ole.tange ★ 4.4k

0

Entering edit mode

I ran your script. It took the last sequence of the last file and repeated it over 3000 times in the "file.all" file!

ADD REPLY • link 10.1 years ago by akhalili26 • 0

0

Entering edit mode

I have re-tested it on the 6 files you give, and I still get the correct result. Is there something very different hiding in your other files? Can you reproduce my results by running it on your 6 example files?

ADD REPLY • link 10.1 years ago by ole.tange ★ 4.4k

1

Entering edit mode

Were these files ever saved on Windows? That might be the issue.

ADD REPLY • link 10.1 years ago by pld 5.1k

0

Entering edit mode

That might just explain the results. Edited answer to support both Windows and Mac. Does it work now?

ADD REPLY • link 10.1 years ago by ole.tange ★ 4.4k