Biostar Beta. Not for public use.
Question: Command For Merging/Joining
0
Entering edit mode

Hi, My protein of interest is a multiple domain protein, and I would like to prepare a "structure based multiple sequence alignment" of it (around 1000 sequences).

Since the available structures of my protein are in different conformational states, I had to make multiple sequence alignments for EACH DOMAIN SEPARATELY. Now I have 6 high quality alignments, corresponding to the six domains of my protein.

The question is that: How can I merge/attach/join these files together? Notes:

  • I am new to programming!
  • All files are in Fasta format (also i have them in Clustal format)
  • The length of ALL sequences in each file is the same (but the length of sequences in different files are different. For example in File 1, the length of all sequences is 120, and in file 2 all of them are 78)
  • Sequences are arranged base on the "ID" in these files (i.e. the sequence IDs in the files are the same & are in the same order)
  • I've used cat command, but it just added sequences at the end of file!
  • Example files are presented below (I have shown only the first 4 sequences. there are around 1000 sequences). they are not real sequences.

I appreciate your help regarding to this matter, and please let me know if you need any more information.

Cheers, Ali

#

#

File 1:

>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
MLGWIAKKII
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
MLGWIAKKII
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
MLGWIAKKII
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
MLGWIAKKII

File 2:

>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
--GWIAKKI-AAAA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
--GWIAKKI-AAAA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
--GWIAKKI-AAAA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
--GWIAKKI-AAAA

File 3:

>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
--GGYYYKLIAKKI-AAAA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
--GGYYYKLIAKKI-AAAA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
--GGYYYKLIAKKI-AAAA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
--GGYYYKLIAKKI-AAAA

File 4:

>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
--------HHWSGYYYKLIAKKI-AAAA--D
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
--------HHWSGYYYKLIAKKI-AAAA--D
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
--------HHWSGYYYKLIAKKI-AAAA--D
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
--------HHWSGYYYKLIAKKI-AAAA--D

File 5:

>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
AEE
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
EEE
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
EEE
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
EEE

File 6:

>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
A
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
A
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
A
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
A

So the final file should look like this: Result file:

>gi|6226123|sp|O67718.1|SECA_AQUAE/1-88 RecName: Full=Protein translocase subunit SecA
MLGWIAKKII--GWIAKKI-AAAA--GGYYYKLIAKKI-AAAA--------HHWSGYYYKLIAKKI-AAAA--DAEEA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA1
MLGWIAKKII--GWIAKKI-AAAA--GGYYYKLIAKKI-AAAA--------HHWSGYYYKLIAKKI-AAAA--DEEEA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA3
MLGWIAKKII--GWIAKKI-AAAA--GGYYYKLIAKKI-AAAA--------HHWSGYYYKLIAKKI-AAAA--DEEEA
>gi|81837980|sp|Q821L5.1|SECA_CHLCV/1-78 RecName: Full=Protein translocase subunit SecA4
MLGWIAKKII--GWIAKKI-AAAA--GGYYYKLIAKKI-AAAA--------HHWSGYYYKLIAKKI-AAAA--DEEEA
ADD COMMENTlink 6.0 years ago akhalili26 • 0 • updated 6.0 years ago User000 • 270
Entering edit mode
1

show us a few lines of your fasta files please...

ADD REPLYlink 6.0 years ago
Pierre Lindenbaum
120k
Entering edit mode
0

Updated my question. please let me know if you need any more information.

ADD REPLYlink 6.0 years ago
akhalili26
• 0
Entering edit mode
1

That looks like CLUSTAL format, or something, but is not at all fasta.

ADD REPLYlink 6.0 years ago
pld
4.8k
Entering edit mode
2

It's called FASTA multiple alignment format; similar to FASTA but gap characters ("-") are allowed - http://www.bioperl.org/wiki/FASTA_multiple_alignment_format.

ADD REPLYlink 5.9 years ago
Neilfws
48k
Entering edit mode
0

do you want to merge the sequences together or files? basically, it is the same ID?

ADD REPLYlink 6.0 years ago
User000
• 270
Entering edit mode
0

I didn't get your question but if you are trying to join files side by side instead of end to end, you can try "join" in UNIX. Check here: http://www.albany.edu/~ig4895/join.htm

ADD REPLYlink 6.0 years ago
Ashutosh Pandey
11k
Entering edit mode
0

Does all of your 1000 sequences have their name ending with "SecA"?

ADD REPLYlink 6.0 years ago
Ashutosh Pandey
11k
4
Entering edit mode

Here is a python script that will concat your alignments, assuming each alignment has the same id in the fasta header

import sys
import re
def __main__():
    infile  = sys.argv[1]
    outfile = sys.argv[2]

    with open(infile, 'rb') as fi:
        seqs = fi.read().split('>')[1:]
        seqs = [x.split('\n')[:2] for x in seqs]

    merge = dict()

    #index 0 is header
    #index 1 is seq
    for x in seqs:
        key = re.search('(?<=gi\|)([0-9]*)(?=\|)', x[0]).group(0)
        try:
            merge[key] = merge[key] + x[1]

        except KeyError:
            merge[key] = x[1]


    #write results to file
    with open(outfile, 'w') as fi:
       for x in merge.keys():
           fi.write('>' + x + '\n')
           fi.write(merge[x] + '\n')

__main__()

To put your individual files into a single file:

cat f1.fasta, f2.fasta, ... > bigfile.fasta

You would then run the above script with:

python myscript.py bigfile.fasta outputfile.fasta

Just remember to adjust the file names here to whatever you're working with.

ADD COMMENTlink 5.9 years ago pld 4.8k
Entering edit mode
0

Sorry for the errors in the code, I verified the regex but typed it into the post, hence the errors. Everything works now.

ADD REPLYlink 5.9 years ago
pld
4.8k
1
Entering edit mode

If I understand correctly, you have six files (one for each domain) and each file having 1000 sequences.

1) Excluding the clustalw file for the first domain, run the following command for all the other five files. Make sure you have backed up your files so if something goes wrong, your original files remain unchanged.

sed -i ';s/.* //' File_name_2

sed -i ';s/.* //' File_name_3

........

sed -i ';s/.* //' File_name_6

2) Then use join command in unix and concatenate the first file with the second , third .. sixth file.

join File_name_1 File_name_2 ... File_name_6 > Mega_file

Hope it works.

ADD COMMENTlink 6.0 years ago Ashutosh Pandey 11k
Entering edit mode
0

I'm new to scripting! I got this when running sed -i ';s/.* //' domain2_cut_clustal

sed: 1: "domain2_cut_clustal": extra characters at the end of d command

ADD REPLYlink 6.0 years ago
akhalili26
• 0
Entering edit mode
0

There should be a gap between "'" and domain_cut_clustal

ADD REPLYlink 6.0 years ago
Ashutosh Pandey
11k
Entering edit mode
0

Thanks for your help. It does the job of combining the files. BUT, When I tried to use the final "Mega_file", i realized that it is not a Clustal file! since in the middle of the file some of the lines don't meet the Clustal's standard. (i. e. some of the lines have less than 60 sequence symbols. This is because, for example, the last lines of file 2, which are in the middle of the "Mega_file", have 24 sequence symbols (which i believe will be interpreted as the end of file!))

any suggestion for turning this "Mega file" into a Fasta file or Clustal file?

ADD REPLYlink 5.9 years ago
akhalili26
• 0
1
Entering edit mode

Essentially, what you want to do regardless of programming language is:

  1. Parse a set of "fasta-like" files (they are FASTA multiple alignment format which for all intents and purposes is fasta)
  2. Extract the ID, description and sequence
  3. For each unique ID + description, concatenate the associated sequences
  4. Print results

So here is a quick, dirty and ugly solution which employs BioPerl SeqIO. I'm assuming a Linux-like system with a command line. BioPerl installation on Ubuntu-based Linux is as simple as _sudo apt-get install bioperl_.

First, assuming that your sequence files are the only files in their directory and they end with the suffix ".fa", concatenate them into one file:

cat *.fa > allseqs.fa

Now the BioPerl; save this as _merge_seqs.pl_ :

#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;

my %seqs; # a hash to hold the results
my $file = shift or die("Usage = merge_seqs.pl <fasta file>\n");
my $seqio = Bio::SeqIO->new(-file => $file, -format => "fasta");

while(my $seq = $seqio->next_seq) { 
  $seqs{$seq->display_id." ".$seq->description} .= $seq->seq;
}
# now print
foreach my $key (keys %seqs) { 
  print ">$key\n$seqs{$key}\n";
}

and run like so (output to the file _merged.fa_ ):

perl merge_seqs.pl allseqs.fa > merged.fa
ADD COMMENTlink 5.9 years ago Neilfws 48k
0
Entering edit mode

This ought to do it:

unix$ cat file* | perl -pe 's/\r\n?/\n/' | perl -ne 'chomp;if(/^>/) { $id = $_ } else { $content{$id}.=$_} END{print map { "$id\n$content{$id}\n" } keys %content}'  > file.all

It will work even if the order of the sequences inside the files are flipped around or if the lengths are different inside the files. It requires all files to be small enough to fit in memory.

Edit:

Modified to accept Apple \r and Microsoft \r\n line separation, too.

ADD COMMENTlink 5.9 years ago ole.tange ♦ 3.4k
Entering edit mode
0

I ran your script. It took the last sequence of the last file and repeated it over 3000 times in the "file.all" file!

ADD REPLYlink 5.9 years ago
akhalili26
• 0
Entering edit mode
0

I have re-tested it on the 6 files you give, and I still get the correct result. Is there something very different hiding in your other files? Can you reproduce my results by running it on your 6 example files?

ADD REPLYlink 5.9 years ago
ole.tange
♦ 3.4k
Entering edit mode
1

Were these files ever saved on Windows? That might be the issue.

ADD REPLYlink 5.9 years ago
pld
4.8k
Entering edit mode
0

That might just explain the results. Edited answer to support both Windows and Mac. Does it work now?

ADD REPLYlink 5.9 years ago
ole.tange
♦ 3.4k

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0