Question

Merge Files (Adding File Names To Sequence Names)

2

Entering edit mode

10.3 years ago

biolab ★ 1.4k

Dear all,

I have many sequence files. An example is shown below.

FILE1:  Homo sapiens
>Act1
tgtgcatgcacatgactgaccaca
>Tub1
gttgcatgcatgcatgacgtacactg
...............


FILE2: C elegans
>Act1
tgcatgtgactgactgac
>Tub1
catgcatgactgcatgactgactg
.................


FILE3: A thaliana
>Act1
actgactgtacgtgactgactactgac
>Tub1
tgtctgatgcatgcatgactgactgtgacactg
.................

I need to merge all files and add species name to sequence name, the output is like below.

>Act1-Homo sapiens
tgtgcatgcacatgactgaccaca
>Tub1-Homo sapiens
gttgcatgcatgcatgacgtacactg
>Act1- C elegans
tgcatgtgactgactgac
>Tub1- C elegans
catgcatgactgcatgactgactg
>Act1-A thaliana
actgactgtacgtgactgactactgac
>Tub1-A thaliana
tgtctgatgcatgcatgactgactgtgacactg

Would anyone please inform me a perl, awk or shell code to finish this? Thank you very much!!

perl awk • 8.5k views

ADD COMMENT • link updated 8.5 years ago by Biostar 20 • written 10.3 years ago by biolab ★ 1.4k

0

Entering edit mode

Be aware that if you keep the space in the species name, it will generate a break in the FASTA header. So for example in the first line, Act1-Homo will be interpreted as the sequence ID and sapiens as a sequence description.

ADD REPLY • link 10.3 years ago by Neilfws 49k

0

Entering edit mode

Yes, it should be considered.

ADD REPLY • link 10.3 years ago by biolab ★ 1.4k

score 4 · Answer 1 · 2013-12-16

4

Entering edit mode

10.3 years ago

Pierre Lindenbaum 161k

awk '/^>/ {printf("%s-%s\n",$0,FILENAME);next;} {print;}' "Homo sapiens" "C elegans" "A thaliana"

ADD COMMENT • link 10.3 years ago by Pierre Lindenbaum 161k

4

Entering edit mode

Hi Pierre, here is a shorter version (I like to code golf): awk '{print /^>/ ? $0"-"FILENAME : $0}' *. If the line starts with >, print the line $0 + the FILENAME, else print the line. I replaced the list of files by a star, to be sure to adapt to the situation where many files have to be treated.

ADD REPLY • link 10.3 years ago by Frédéric Mahé ★ 3.2k

1

Entering edit mode

Thank you Frederic :-)

ADD REPLY • link 10.3 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thank you both. The awk commands are really cool.

ADD REPLY • link 10.3 years ago by biolab ★ 1.4k

0

Entering edit mode

Hi Pierre, Thank you greatly for your answer. However, the above is an example only. I have many files. I wrote a perl script (I am a true beginner) to do this job.

#!/usr/bin/perl
use strict;
use warnings;

my @files = @ARGV;

foreach my $filename (@files){
    $a = shift $filename;
    while (my $line = <>){
    chomp $line;
    if ($line =~/>/){
        print "$line" . '-' . "$a\n";
        } else {
        print "$line\n";}
    }
}

My command is $ perl merge.pl *.fa
The error messag is Not an ARRAY reference at merge.pl line 8. Could you please check the above script and correct it? Thank you very much!

ADD REPLY • link 10.3 years ago by biolab ★ 1.4k

1

Entering edit mode

Correcting your code:

#!/usr/bin/perl
use strict;
use warnings;

my @files = @ARGV;

foreach my $filename (@files){
    open FH, "$filename" or die;
    while (my $line = <FH>){
    chomp $line;
    if ($line =~/>/) {
        print "$line" . '-' . "$filename\n";
    } else {
        print "$line\n";}
    }
    close FH;
}

Check that $filename contains your file name and need to be used to read the file content as FH. After each cycle, you need to close it.

ADD REPLY • link 10.3 years ago by JC 13k

1

Entering edit mode

Well done! In case you might be interested, you could do the following:

#!/usr/bin/perl
use strict;
use warnings;

my @files = @ARGV;

foreach my $filename (@files) {
    open my $fh, '<', $filename or die $!;
    while (<$fh>) {
        chomp;
        print /^>/ ? "$_-$filename\n" : "$_\n";
    }
}

This uses the three-argument form of open, including lexical file handles. Note that close $fh; is missing. Perl will automatically close the file when a new handle is assignned to $fh. Also, the above takes advantage of Perl's default scalar $_ when chompping and matching. Finally, the if-then-else is replaced by the ternary operator.

ADD REPLY • link 10.3 years ago by Kenosis ★ 1.3k

score 3 · Answer 2 · 2013-12-16

Here's another Perl option:

use strict;
use warnings;

for my $file ( grep $_ ne $0, <*> ) {
    open my $fh, '<', $file or die $!;

    while (<$fh>) {
        chomp;
        s/^>.+\K/-$file/;
        print "$_\n";
    }
}

Usage: perl script.pl [>outFile]

The last, optional parameter directs output to a file.

Drop the script into the directory where only your fasta files live that you want to combine. The script will read all the file names in that directory--filtering out its own name--iterate through all, and append the current file name it's reading to the end of each fasta header line of the file.

It's problematic just sending file names on the command line to such a script, since you'd need to enclose each name within quotes, as Pierre Lindenbaum did with his example, because of the spaces. This solution bypasses the need to send those names to the script.

How does this work? The script uses a glob (<*>) to get a listing of all files in the current directory, then greps each against the script's name, so it's not going to read in the script. As each file line is read, the input record separator (usually \n) is chompped off, and a substitution--which looks for a > at the start of a line--is used. The substitution matches the complete header line, but the \K says "\Keep all to the left," so a hyphen and the file name is added to that line. Finally, that line is printed. (Note there is no close $fh;. Perl will automatically close the opened file when a new file handle is assigned to $fh.)

The following Python script produces the same results:

#!/usr/bin/python
# -*- coding: utf-8 -*-
import os
import re

for fileName in os.listdir('.'):
    if fileName == os.path.basename(__file__) or os.path.isdir(fileName):
        break

    inFile = open(fileName, 'r')
    for line in inFile:
        print re.sub(r'^(>.+)', r'\1-%s' % fileName, line.rstrip('\n'))

Usage: python script.py [>outFile]

Hope this helps!

score 2 · Answer 3 · 2013-12-21

2

Entering edit mode

10.3 years ago

Malcolm.Cook ★ 1.5k

here's a perl one-liner to call from the command line that you can use with your multiple fasta files:

perl -lpe '$_.="-$ARGV" if m/^>/' *.fa

ADD COMMENT • link 10.3 years ago by Malcolm.Cook ★ 1.5k

0

Entering edit mode

Hi Malcolm, thank you very much!

ADD REPLY • link 10.3 years ago by biolab ★ 1.4k

0

Entering edit mode

Simple and very useful!! Thank you!!

ADD REPLY • link 9.4 years ago by Prakki Rama ★ 2.7k

score 1 · Answer 4 · 2013-12-16

Hello biolab, I don't know if you are a beginner in Perl or in programming. If you are starting at programming too, I would suggest you to try python, as it is closer to english. Here is a longer version that is easier to read and understand:

#!/usr/bin/python
import os

def merge(file_name):
    data = []
    with open(file_name) as fh:
        for line in fh.readlines():
            line = line.replace("\n", '')
            if line.startswith('>'):
                new_header = (line+" -"+file_name)
                data.append(new_header+"\n")
            else:
                data.append(line+"\n")

    with open('output','a') as fh:
        fh.write(''.join(data))

for file_name in os.listdir('.'):
    if file_name.endswith(".fa"):
        merge(file_name)