Interleaving Records From More Than One Fasta File
4
4
Entering edit mode
11.1 years ago
aindap ▴ 120

I am trying to run a model in PAML that requires me to combine records from multiple fasta files. I have 6 FASTA files, each with the same number of records. What I want to do is interleave the records into a single file such that my result file has:

>record1_fileOne
AAA
>record1_fileTwo
AAA
>record1_fileThree
AAA
>record1_fileFour
AAA
>record1_fileFive
AAA
>record1_fileSix
AAA
>record2>fileOne
GGG

I wrote the code below which just concatenates the fasta records, not interleaving them. I think there is probably some sort of trick I can use using python itertools that I'm just not seeing. Can anyone point me in the right direction?

I found this script that interleaves 2 fasta files, but I need to extend it to N fasta files:

def read_fasta(fh):
    """ generator for reading a fasta record: taken from [http://stackoverflow.com/a/7655072/1735942][2] """
    name, seq = None, []
    for line in fh:
        line = line.rstrip()
        if line.startswith(">"):
            if name: yield (name, ''.join(seq))
                name, seq = line, []
            else:
                seq.append(line)
        if name: yield (name, ''.join(seq))




fastafiles=args[0:]
filehandles=list(itertools.imap(open, fastafiles)) #list of filehandles for fasta files 

for fh in filehandles:
    for id,seq in read_fasta(fh):
        print id
        print seq
python paml fasta • 6.7k views
ADD COMMENT
4
Entering edit mode
11.1 years ago

linearize each fasta file and put them into paste. An example with 3 files:

paste <(awk '/^>/ { printf("\n%s_file1\t",$0);next;} {printf("%s",$0);} END {printf("\n");}' file1.fasta ) <(awk '/^>/ { printf("\n%s_file2\t",$0);next;} {printf("%s",$0);} END {printf("\n");}' file2.fasta )  <(awk '/^>/ { printf("\n%s_file3\t",$0);next;} {printf("%s",$0);} END {printf("\n");}' file3.fasta ) | tr "\t" "\n"
ADD COMMENT
4
Entering edit mode
11.1 years ago
brentp 24k

this is a nice chance to use some of python's itertools machinery. First the code:

from itertools import groupby, izip, chain
import sys

def fasta_iter(fasta_name):
    fh = open(fasta_name)
    faiter = (x[1] for x in groupby(fh, lambda line: line[0] == ">"))
    for header in faiter:
        header = header.next()[1:].strip()
        yield header, "".join(s.strip() for s in faiter.next())

# interleave the fastas with izip and chain the results. this is all lazy
fastas = chain.from_iterable(izip(*[fasta_iter(fa) for fa in sys.argv[1:]]))

for header, seq in fastas:
    print ">", header
    print seq

the fasta_iter just takes a fasta file name, and creates a generator that yields the header and the sequence.

Note that it's using the groupby function and splitting into new groups based on the apearance of ">" as the first element in the line.

Once it finds the header group by the presence of ">", it knows that anything in the next group (without ">") is the sequence and it can just merge lines from that group.

it uses that to create a lazy list of iterables with:

[fasta_iter(fa) for fa in sys.argv[1:]]

that only reads from each file on demand. Since you want to interleave the records, you call izip, which is the lazy version of python's zip:

>>> zip("abc", "123")
[('a', '1'), ('b', '2'), ('c', '3')]

and it can handle as many arguments as you want.

Then we use itertools.chain.from_iterable so that we can use only a single loop to iterate over the files -- otherwise, with zip, we'd need an extra loop.

ADD COMMENT
0
Entering edit mode

There's no need to write any code to group fasta records, as that's one well-implemented function of Biopython.

ADD REPLY
0
Entering edit mode

well, if you need Biopython anyway then go for it, but if you just want to parse these files without introducing an additional dependency, then that's a great solution!

ADD REPLY
0
Entering edit mode

finally, somebody who understands Python concepts and is not just offering 'same-in-every-language' solutions. +1 on this

ADD REPLY
3
Entering edit mode
11.1 years ago

Here's a way to do it without using external libraries, which also demonstrates how to interleave an arbitrary number of input files:

#!/usr/bin/env perl   

use strict;
use warnings;

#
# let's say your files are named file-1.fa through file-6.fa...
#

#
# open file handles for each of the six files, storing a reference to the 
# file handle in a list
#
my $fhRef;
foreach my $idx (1..6) {
  my $sourceFn = "file-$idx.fa"; 
  local *FILE;
  open (FILE, "< $sourceFn") or die "Could not open $sourceFn\n";
  $fhRef->[$idx] = *FILE;
}

#
# interleave data and print to standard output until we reach EOF on 
# the last file
#
# NOTE: this assumes sequence data are on one line, i.e., that the 
# header and sequence lines alternate on separate lines
#
my $fhsOpen = 0;
while ($fhsOpen == 0) {
  foreach my $idx (1..6) {
    my $fh = $fhRef->[$idx];
    my $header = <$fh>;
    my $sequence = <$fh>;
    print STDOUT "$header$sequence";
    if ((eof($fh)) && ($idx == 6)) {
      $fhsOpen = 1;
    }
  }
}

#
# close handles, for general file I/O hygiene
#
foreach my $idx(1..6) {
  close $fhRef->[$idx];
}
ADD COMMENT
3
Entering edit mode

This solution is problematic, for several reasons. Programatically, these include:

  • The three-argument open should be used
  • Lexically-scoped file handles should be used

Additionally, how would your script handle the following valid fasta record?

>YAL069W-1.334 Putative promoter sequence
CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACA
CATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTT
ACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCATTCAACCATACCACTCCGAAC
CACCATCCATCCCTCTACTTACTACCACTCACCCACCGTTACCCTCCAATTACCCATATC
CAACCCACTGCCACTTACCCTACCATTACCCTACCATCCACCATGACCTACTCACCATAC
TGTTCTTCTACCCACCATATTGAAACGCTAACAA

It can't. You can easily fix the first two; and when using lexically-scoped variables for files handles, the files automatically close when those variables fall out of scope or the program ends. The third issue's solved by initializing a local copy of Perl's record separator to '>': the fasta record separator. For example, the following does this and interleaves the fasta records:

use strict;
use warnings;

local $/ = '>';
my %files = map { open my $fh, '<', $_ or die $!; $_ => $fh } @ARGV;

while (1) {
    for (@ARGV) {
        defined( my $record = readline $files{$_} ) or exit;
        next if $record eq $/;
        chomp $record;
        $record =~ s/\s+\z//;
        print $/ . "$record\n";
    }
}

Usage:

perl script.pl fastaFile [fastaFiles ... ] [>outputFile]

I applaud(!) you giving the OP a non-module option, but it's also beneficial to become familiarized with and use the well-developed bio modules which live for such tasks.

ADD REPLY
1
Entering edit mode

I was offering a solution that works with the input as described, though you are correct that it would not handle multiline sequence data without extra work. I like your use of map to set up the handle table, though. Much nicer. I was sloppy with the open call so I added a die message in case the file can't be opened.

ADD REPLY
2
Entering edit mode
11.1 years ago
Kenosis ★ 1.3k

Here's a Perl option that uses Bio::SeqIO:

use strict;
use warnings;
use Bio::SeqIO;

my @fastaFiles = qw(file1.fasta file2.fasta file3.fasta file4.fasta file5.fasta file6.fasta);
my %SeqIO_Objects;

my $outFasta = Bio::SeqIO->new( -file => '>outFile.fasta', -format => 'Fasta' );

$SeqIO_Objects{$_} = Bio::SeqIO->new( -file => $_, -format => 'Fasta' )
  for @fastaFiles;

while (1) {
    for my $fastaFile (@fastaFiles) {
        $outFasta->write_seq( $SeqIO_Objects{$fastaFile}->next_seq() or exit );
    }
}

This creates a hash of Bio::SeqIO objects, and reads one fasta record from each object within the while, and then writes those records to outFile.fasta. It exits the while loop when a read fails (no more records).

Here's a Python option that uses Biopython:

from Bio import SeqIO

fastaFiles = [
    'file1.fasta',
    'file2.fasta',
    'file3.fasta',
    'file4.fasta',
    'file5.fasta',
    'file6.fasta',
    ]
allRecords = [list(SeqIO.parse(open(fastaFile, 'rU'), 'fasta'))
              for fastaFile in fastaFiles]

outFasta = open('outFile.fasta', 'w')

for i in range(0, len(allRecords[0])):
    for fastaRecord in allRecords:
        SeqIO.write(fastaRecord[i], outFasta, 'fasta')

outFasta.close()

This uses a list comprehension to create a list of lists, where each list is the set of Fasta records from each Fasta file. It then iterates through these lists to interleave the files' records, and writes those out to outFile.fasta.

Hope this helps!

ADD COMMENT
1
Entering edit mode

this solution is needlessly reading all of the fasta files into memory.

ADD REPLY
0
Entering edit mode

Yes, it is, as I was unable to locate BioPython's equivalent to BioPerl's fasta next() method.

ADD REPLY
0
Entering edit mode

I'd bet my fasta that SeqIO.parse() returns an iterable.

ADD REPLY
0
Entering edit mode

The documentation says yes: "The main function is Bio.SeqIO.parse() which takes a file handle and format name, and returns a SeqRecord iterator." (cf.http://biopython.org/wiki/SeqIO)

ADD REPLY
0
Entering edit mode

Appreciate this, and have provided an updated script in a reply.

ADD REPLY
0
Entering edit mode

You'd win that bet and I appreciate your comment. Have literally written only a few Python scripts, so am on a learning curve. Would now just do the following with the iterable:

import sys
from Bio import SeqIO

gens = [SeqIO.parse(open(fastaFile, 'rU'), 'fasta') for fastaFile in
        sys.argv[1:]]

while True:
    for i in range(0, len(gens)):
        try:
            inRec = gens[i].next()
            print inRec.id + '\n' + inRec.seq
        except:
            exit()

Usage:

python script.py file1.fasta file2.fasta [...] [>fileOut.fasta]
ADD REPLY
0
Entering edit mode

Here, and in your comments, while (1) {} is superfluous in the Perl code. BioPerl will crash if it gets no files before you get to the while loop, and in your example in the comments to Alex's answer, open will die if gets no files before you get to the while loop. Why do you think it's necessary? The only reason I can think of is you're trying to check if you got any files, but you will never get to that point in that situation.

ADD REPLY
0
Entering edit mode

Niether have to do with files, but rather file lines, as both Perl scripts read sets of six lines at a time from the files, and there's no way to know how many sets there are, so the while (1) { ... is used to keep reading those sets and is exited when no more sets can be read.

ADD REPLY
0
Entering edit mode

When would a for (...) {} loop alone not do that for you? Also, I think you are reading/writing one line from one file at a time, not "six lines at a time from all the files" as you say.

ADD REPLY
0
Entering edit mode

What would be the iterable in that for loop such that you know the number of lines in the fasta files? Also, the script is reading sets of six lines at a time--not six lines at a time.

ADD REPLY
0
Entering edit mode

You are right, I stand corrected :). On a multi-fasta, while(1) is needed. Nice work.

ADD REPLY
0
Entering edit mode

Thank you, SES! :)

ADD REPLY

Login before adding your answer.

Traffic: 2438 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6