Removing sequences from .fasta files based on name prefix
2
1
Entering edit mode
9.7 years ago
cklinge2 ▴ 10

Hello all,

I have about 900 .fasta files, each comprised of nucleotide alignments for various genes. The files look something like the following (truncated for the sake of illustration):

>fig|395491.5.peg.776
ATCGCGGCGGCCCGCCGCGGCGATATCCGTCTGCACAAACGCATCGTCGCCGGCCTCTAT
TTCGGCGGCATCATCCTCGCCGGCCTGTTCACTTTCGTGCCTGGCCGGATCATGCACGCG
GTCGTCTTCACCGGAGCGGAATGGCCGGCTGCCCTTGCTGCGGCGGTGATG---------
---------------------------------------------------
>fig|395491.28.peg.1692
ATCGCGGCGGCCCGCCGCGGCGATATCCGTCTGCACAAACGCATCGTCGCCGGCCTTTAT
TTCGGCGGCATCGTCCTTGCCGGCCTGTTCACTTTTGTGCCTGGCCGGATCATGCACGCG
GTCGTCTTCACCGGAACGGAATCGCCGGCTGCCCTTGCTGTGGCGGTGATG---------
---------------------------------------------------
>fig|395491.777.peg.2918
ATCGCGGCGGCCCGCCGCGGCGATATCCGTCTGCACAAGCGCATCGTCGCCGGCCTTTAT
TTCGGCGGCATCGTCCTTGCCGGCCTGTTCACTTTCGTGCCTGGCCGGATCATGCATGCG
GTCATCTTCACCGGCGCGGAATGGCCGGCTGTCCTTGCTGCGGCGGTGATCGGCTCGATC
CTGATCGCAATCGCG---CTGCGTCGCCGACGCGGGCGCCTGATTGCCCGA

What I want to do is search through each file and remove the DNA sequence associated with genome "fig|395491.777." Each file, however, has a different ".peg." number added onto the "fig|395491.777" title.

So basically, if the sequence name begins with "fig|395491.777," I want it to be removed. I'm sure there is a relatively straightforward way to complete this task, but it's a mystery to me. I'd appreciate any assistance with this issue.

Thanks in advance!

removesequence filter fasta • 10.0k views
ADD COMMENT
2
Entering edit mode

Have you tried anything yet? People in this forum would appreciate if you tell them that you tried an awk or grep command or some perl script. Look at this and try: http://stackoverflow.com/questions/15848713/how-to-extract-fasta-sequences-in-a-file-which-header-line-matches-with-list-in

ADD REPLY
0
Entering edit mode

I am using a linux server and I am very new to utilizing command line functions and/or scripting, so I tried a few scripts that I found while browsing various pages on this forum (though it seems most of the approaches required an exact [i.e. all characters included] header name for a sequence to be removed). For example, I tried running the following python script (though it doesn't seem to work even if I give it an exact header name):

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
from Bio import SeqIO

fasta_file = sys.argv[1]  # Input fasta file
remove_file = sys.argv[2] # Input wanted file, one gene name per line
result_file = sys.argv[3] # Output fasta file

remove = set()
with open(remove_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
            remove.add(line)

fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')

with open(result_file, "w") as f:
    for seq in fasta_sequences:
        nuc = seq.seq.tostring()
        if nuc not in remove and len(nuc) > 0:
            SeqIO.write([seq], f, "fasta")

I looked into the thread you suggested, though admittedly I'm not sure how I would adapt that to my particular problem.

ADD REPLY
1
Entering edit mode
  1. Concatenate all your fasta files in one big file (cat *.fasta > big_file.fasta)
  2. Try running: awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' big_file.fasta > big_file.tab (This will give you a tab delimited fasta file with header as the first column and sequence as the second column)
  3. Now grep your "header of interest" or "fig|395491.777." (grep "fig|395491.777." big_file.tab > small_file.tab)
  4. Convert small tab delimited file into fasta again by replacing "\t" with "\n". try "sub" or "sed" commands.

Meanwhile keep working on the script.

ADD REPLY
1
Entering edit mode

I just read your question again and I realized that you want to remove those sequences. You can use grep -v for individual files than concatenating everything into one. You will have to write a loop that goes through all the files. Can be simple done in shell.

ADD REPLY
0
Entering edit mode

You'll find python to be a pretty useful language. The line you want to change is if nuc not in remove and len(nuc) > 0:. That's looking for an exact match. You could instead look to see if nuc starts with remove with if nuc.startswith(remove,1) and len(nuc) > 0 : . This would only work with a single sequence to remove (you could just hard code that in rather than reading it from a file). If you had a list, then just use nuc.find("peg") to find the numeric index to "peg"e; and then subset nuc (then using the if... clause as is).

ADD REPLY
0
Entering edit mode

Biopython is installed on your server?

ADD REPLY
1
Entering edit mode

What have you tried? This is doable in pretty much any scripting language.

ADD REPLY
0
Entering edit mode

As shown in my response to Ashutosh, I tried running a python script but it doesn't seem to work. I also tried to utilize the Perl code found in this thread (but I don't fully understand how I might use it for my particular question).

ADD REPLY
2
Entering edit mode
9.7 years ago

In general it is always best to think in terms of what you want to keep rather than what to remove. Then a possible solution with bioawk would be:

for example

$ cat fasta.fa
>A1
ATG
>B1
CTG
>A2
AAA

ialbert@apollo ~
$ cat fasta.fa | bioawk -c fastx ' $name ~ /^A/ { print ">" $name "\n" $seq }'
>A1
ATG
>A2
AAA
ADD COMMENT
0
Entering edit mode

This seems like a great solution. How might I adapt this to keep a subset of sequences with variable names? For instance, if I wanted to keep all the "A"s, "C"s and "D"s out of the following:

>A1
ATG
>B1
CTG
>C2
AAA
>D1
ATG
ADD REPLY
1
Entering edit mode

You will need to build a regular expression that matches what you want. The | character represents the OR operator. Alternatively you can pass the file through multiple filtering steps and concatenate the output.

cat fasta.fa | bioawk -c fastx ' $name ~ /^A1|^A2/ { print ">" $name "\n" $seq }'

Awk regexp: http://www.math.utah.edu/docs/info/gawk_9.html#SEC90

ADD REPLY
0
Entering edit mode
9.7 years ago
Ram 43k

Extending Istvan's answer and assuming all 900 FASTA files are in the same directory (and they're the only FASTA files in that directory), I'd suggest this:

mkdir filtered
for fa in $(ls *.fa)
do
  cat ${fa} | bioawk -c fastx ' $name ! ~ /fig|395491\.777/ { print ">" $name "\n" $seq } >>./filtered/${fa}
done

Now, you have all the filtered files with matching names in a sub-directory called filtered.

ADD COMMENT
1
Entering edit mode

My advice on matching rather then negating is just a recommendation to reduce cognitive overhead. But if one really needs to negate the match then the operator to do so is !~ as Ram shows it above

ADD REPLY
0
Entering edit mode

I also prefer matches to negation as we're always less sure of what we have (everything, including unforeseen outliers) than what we need. However, the negation is better in cases where you have a lot fewer elements on your do-not-need bag than in your do-need bag and/or you're sure of these do-not-need elements.

ADD REPLY

Login before adding your answer.

Traffic: 2594 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6