FASTA sequence extraction using list of sequence name that are not identical with the headers?
5
1
Entering edit mode
8.8 years ago
seta ★ 1.9k

Hi all,

I 'm trying to extract some sequences from a fasta file that their header are as following:

>c0_g1_i1 len=526 path=[27:0-525]
TCTAGCTTGAAACCTGACATTAAAATGAAAAGGAGATCACTTGTAGTCCAATAAACCAAA
CTAGTTTTTAGAGAAATTTATTACTGTTTTATCTTCCCAACACTTGCAAATTTAGTAAGG
TATACATAAAAGTATGACATAATATTTAGATAAATAGATCTCTATTTTCATATGTTCAAG
AAAAAATCATCCTCATTTCTCAAAATTCTGAATAGAAAAGCTAAAATTCATGTGCTCTGA
TCATGTTATCCCTATGTCTAGCATTATTTGGTTCAGCCTGCGGTTAGTACAACTGAAGGG
TCACTGCCACGAAACTCCAACCATCGCAAATTCAAACATTTTACAAAAGAGCATGCGCTA
AATGACAATAATAGACTCCAGGTGATGTGTTCCATTGTTTCTAACACTTTGGCTATGGCA
AATTACTTGATAGCAAAATAATACAAGATGAGAAACCACACTTGAAATTACTTAAACCAC
TCTGCTCCAATTAGTCAGTTTGAACTACTAAACTAATAATTTCCTG
>c1_g1_i1 len=472 path=[27:0-471]
GAACTGGATTTAAATCAAGGATGTGGTTATGGATAGACTGCATGCGCATCTCACACATGA
AAACTAAGATCCCCTAGCACTCCCACCGTGCATCCCGCCATTCACAACAATCGTGATGTC
GCCCGGAACGTTTTCCATATGTCCAGGAGGTATCTTTGTCGCAATAACCTGGTTCTCCCG
CACAAACGATACCCGCAAGCAGTCCTGTGCACAGTCGAATACGGTGATGGATCCTGCTTT
GAACTCGTGGAAGGCCTTCTTGTCTTTCTCAGTGAAAGTCGCCTCGGTCCCCGTCACCCA
TAGGGAGTCAAGGTCCGATTTCGTAGTCATGTTTATTACCAAGGCTTGAGCTGCTTGCGT
TTCTAACTCTGCGGAATATGCTGCTGTATCTAGTACTTTGATGCCCAAAGGCAGCGGAAC
GGCCACGTCACAAATAGTTTAAGTTGGGCAGCAAATTAGGCGTGTGTCATGC

But, I have a text file containing sequence name that doesn't exactly match with fasta sequences header, it's like

>c0_g1_i1
>c1_g1_i1

Could you please help me out to extract my sequences of interest? Thanks in advance.

Assembly blast RNA-Seq sequencing • 10k views
ADD COMMENT
1
Entering edit mode
grep -w -f file.txt file.fasta

should return the complete headers..

ADD REPLY
0
Entering edit mode

If there is one space among name in the text file, the command return us the sequence with complete header. Now, I don't know how to create space among name in the text file?. It can be best command if it takes no long time.

ADD REPLY
1
Entering edit mode
8.8 years ago
Anima Mundi ★ 2.9k

In Python (for a FASTA file named foo.fa and a header list named header_list.txt):

def search(header):
    found = False
    for line in open('foo.fa'):
        if header.replace('\n', '') in line:
            print line,
            found = True
        elif '>' in line and found:
            break
        elif '>' not in line and found:
            print line,
            

for j in open('header_list.txt'):
        if len(j) > 2:
            search(j)
ADD COMMENT
0
Entering edit mode

Many thanks for all replies. I try them

ADD REPLY
0
Entering edit mode

The command 's Anima Mundi sounds great. But, it takes long time to finish for a fasta file with about 170 MB. Is it usual?

ADD REPLY
0
Entering edit mode

The previous script was not designed to be particularly fast. How much did it take to run it on your machine? Did you manage? This one anyhow should be faster (though it could be made even faster!), hope it helps:

headers = []
for j in open('header_list.txt'):
    if len(j) > 2:
        headers.append(j.replace('\n', ''))

found = False
for line in open('foo.fa'):
    if '>' in line and not found:
        for item in headers:
            if item in line:
                print line,
                found = True
                break
    elif '>' in line and found:
        for item in headers:
            if item in line:
                print line,
                found = True
                break
            else:
                found = False
    elif '>' not in line and found:
        print line,
ADD REPLY
0
Entering edit mode

Actually, I tested it on just 4 sequences to see if works or not. it run on my original file for 2-3 hours and it has not still finished. it seems unusual. I try new script, thanks

ADD REPLY
0
Entering edit mode

Unfortunately, this script did not work at all. I executed it directly on my original file and after about 40-50 min, an empty output file was appeared, without any error!

ADD REPLY
0
Entering edit mode

It is working for me for the sequences and headers that you posted (now I added a few fake ones and it still works). What happens if you try to run it on a few sequences? What do you get with grep -c '^>' foo.fa?

ADD REPLY
0
Entering edit mode

It works fine for two example sequences that I posted, but it did not work on my original file composed of about 130000 sequences. output of the command was 0 for the original file, strange!

ADD REPLY
0
Entering edit mode

Ok, I see. But next time please provide more information at the beginning (on the input files but also about your machine), as the size of your files is a fundamental part of the problem here ;).

Maybe this solution will work for you:

a) add a > after the last line of your foo.fa

b) grep '^>' header_list.txt > header_list_clean.txt

c) launch this Python script and output on a file called out.fa:

output = {}
curr_head = ''
curr_seq = ''
for line in open('foo.fa'):
    if '>' in line:
        output[curr_head] = curr_seq
        curr_head = ''
        curr_seq = ''
        curr_head += line.replace('\n', '')
    elif '\n' != line:
        curr_seq += line.replace('\n', '')

for keys,values in output.items():
    print(keys) + '\t' + (values)

d) as 5heikki suggests, grep -w -f header_list_clean.txt out.fa > out_desired.fa

e) refine your out_desired.fa with this Python script:

for line in open ('out_desired.fa'):
    print line.replace('\t', '\n'),
ADD REPLY
1
Entering edit mode

Thanks friend. I finally clean fasta header so only first column of header reminded using awk '{print $1}' file.fa > output.fa, then extract desired sequence using the following python script:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""Extract sequences from a fasta file if their name is in a 'wanted' file.

Wanted file contains one sequence name per line.

Usage:
    %program <input_file> <wanted_file> <output_file> """

import sys
import re

try:
    from Bio import SeqIO
except:
    print "This program requires the Biopython library"
    sys.exit(0)

try:
    fasta_file = sys.argv[1]  # Input fasta file
    wanted_file = sys.argv[2] # Input wanted file, one gene name per line
    result_file = sys.argv[3] # Output fasta file
except:
    print __doc__
    sys.exit(0)

wanted = set()
with open(wanted_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
            wanted.add(line)

fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')

with open(result_file, "w") as f:
    for seq in fasta_sequences:
        name = seq.id
        if name in wanted and len(seq.seq.tostring()) > 0:
            wanted.remove(name) # Output only the first appearance for a name
            SeqIO.write([seq], f, "fasta")

It was so easy and fast, just took a few seconds.

ADD REPLY
0
Entering edit mode

All's well that ends well, bye :)

ADD REPLY
0
Entering edit mode
#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""Extract sequences from a fasta file if their name is in a 'wanted' file.

Wanted file contains one sequence name per line.

Usage:
    %program <input_file> <wanted_file> <output_file>"""

import sys
import re

try:
    from Bio import SeqIO
except:
    print "This program requires the Biopython library"
    sys.exit(0)

try:
    fasta_file = sys.argv[1]  # Input fasta file
    wanted_file = sys.argv[2] # Input wanted file, one gene name per line
    result_file = sys.argv[3] # Output fasta file
except:
    print __doc__
    sys.exit(0)

wanted = set()
with open(wanted_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
            wanted.add(line)

fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')

with open(result_file, "w") as f:
    for seq in fasta_sequences:
        name = seq.id
        if name in wanted and len(str(name)) > 0:
            wanted.remove(name) # Output only the first appearance for a name
            SeqIO.write([seq], f, "fasta")

I changed this line:

if name in wanted and len(seq.seq.tostring()) > 0:

to:

if name in wanted and len(str(name)) > 0:

because

/usr/local/lib/python2.7/dist-packages/Bio/Seq.py:341: BiopythonDeprecationWarning: This method is obsolete; please use str(my_seq) instead of my_seq.tostring().
ADD REPLY
1
Entering edit mode
8.8 years ago
venu 7.1k
grep '^>' original.faa > headers.txt
grep -Fwf your_headers.txt headers.txt > final_headers.txt
./faSomeRecords original.faa final_headers.txt output.faa

Download faSomeRecords here. Change mode and use it.

NOTE: Remove '>' from headers file while using faSomeRecords

ADD COMMENT
0
Entering edit mode
8.8 years ago
h.mon 35k

Several answers from this post should work with little or no modification for you.

ADD COMMENT
0
Entering edit mode
8.8 years ago
nterhoeven ▴ 120

One of my colleagues wrote this tool, it does what you need. Just run

./SeqFilter --help

to get an overview of the command line options

ADD COMMENT
0
Entering edit mode
8.8 years ago

To extract sequence form a big list of file you can use faSomeRecords. Detailed instruction here.

ADD COMMENT

Login before adding your answer.

Traffic: 1837 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6