Extract pairs of sequences from a fasta file based on a CSV list of headers
1
0
Entering edit mode
6.6 years ago
ajsakla • 0

I have a fasta file of primer sequences and I have a csv file with the sequence ids of primer sequence pairs that I need to parse out into their own fasta files. What I need to do is extract each pair of sequences from the fasta and put them into their own fasta files.

So my files look something like this:

My fasta file of primer sequences:

>01
ACGTACGT
>02
ACGTACGT
>03
ACGTACGT
>04
ACGTACGT

My CSV list of primer pairs:

column1,column2
01,03
02,04

Based on this example data set, what I need to end up with is 2 fasta files of the pairs of fasta sequences i.e., File1.fa will contain sequences 01 and 03 and File2.fa will contain sequences 02 and 04.

Thanks in advance!

fasta extract • 2.8k views
ADD COMMENT
1
Entering edit mode
6.6 years ago
st.ph.n ★ 2.7k
#!/usr/bin/env python
with open('pairs.txt', 'r') as f:
    # Get pairs, ',' separated
    pairs = []
    for line in f:
        pairs.append((line.strip().split(',')[0], line.strip().split(',')[1]))

with open('file.fasta', 'r') as fa:
    #open fasta to dictionary 
    primer = {}
    for line in fa:
        if line.startswith(">"):
            primer[line.strip().split('>')[1]] = next(fa).strip()

#For each pair, open new file, and print pair in fasta format
for p in pairs:
    with open('file' + str(pairs.index(p)) + '.fasta', 'w') as out:
        out.write('>' + p[0] + '\n' + primer[p[0]])
        out.write('>' + p[1] + '\n' + primer[p[1]])
ADD COMMENT
0
Entering edit mode

Thanks. I'm having some trouble getting this to run.

  1. I've copied and pasted this script into a text file as extract.sh
  2. I've chmod u+x extract.sh
  3. I've replaced the 'pairs.txt' and 'file.fasta' with the appropriate file names
  4. I've run ./extract.sh within the folder containing the two files.

But I get this error:

Traceback (most recent call last): File "./parse.sh", line 11, in <module> for line in f: ValueError: I/O operation on closed file

ADD REPLY
0
Entering edit mode

This is python code, not bash. mv extract.sh extract.py, run as python extract.py

ADD REPLY
0
Entering edit mode

I still get the same error:

Traceback (most recent call last): File "parse.py", line 11, in <module> for line in f: ValueError: I/O operation on closed file

ADD REPLY
0
Entering edit mode

I edited the error.

ADD REPLY
0
Entering edit mode

So now I get this error:

Traceback (most recent call last): File "extract.py", line 18, in <module> with open('file' + str(pairs.index(p)) + '.fasta') as out: IOError: [Errno 2] No such file or directory: 'file0.fasta'

It looks like it's looking for the file or directory file0.fasta so when I make an empty file0.fasta I get this error:

Traceback (most recent call last): File "extract.py", line 19, in <module> out.write('>' + p[0] + '\n' + primer[p[0]]) KeyError: 'ccRepeat-46449'

ADD REPLY
0
Entering edit mode

Should be fixed now.

ADD REPLY
0
Entering edit mode

I'm getting a similar error as before. Key error is the id of the first sequence in the first pair:

Traceback (most recent call last): File "extract.py", line 18, in <module> file.write('>' + p[0] + '\n' + primer[p[0]]) KeyError: 'ccRepeat-46449'

If it helps, here's a sampling of my primer pairs list.

ccRepeat-46449,ge00012
ccRepeat-52142,ge00379
ccRepeat-48210,ge00387
cc00375,ge00431
ccRepeat-46825,ge01057
ccRepeat-49062,ge01469
cc15291,ge01941
ccRepeat-46651,ge03344
cc22391,ge03673
ccRepeat-81763,ge03838

It's just occurred to me that python may not be reading this file as csv formatted. Could that be an issue?

ADD REPLY
0
Entering edit mode

this line is separating the ids by ',' into a tuple:

pairs.append((line.strip().split(',')[0], line.strip().split(',')[1]))

The error you're getting is that the keys in the dictionary (headers from your fasta), are not matching the primers. Double check that the headers match the primers, and if there is anything else in the header, this needs parsed out. The script was based on your OP example, so there may be further things to consider where the script needs to be modified. If you're unable to troubleshoot on your own, it's best to find another method, ideally by your own testing and come back with errors. Biostars is not a coding service.

ADD REPLY
0
Entering edit mode

I realize now I should provide exact examples of the data I'm working with. I was able to get your script to work with my data which saved me many hours of parsing thousands of pairs manually. Thank you for your time.

ADD REPLY

Login before adding your answer.

Traffic: 2478 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6