Rename FASTA files according to FASTA file header
5
0
Entering edit mode
6.2 years ago
cerulean • 0

In a typical FASTA file, how can the header be used as its filename (i.e., replace the current file name with header ID) ?

I have multiple such FASTA files. I have been scouring the internet to find a simple script that I can use in LINUX to obtain the output, but to no avail. I am not well-versed with programming language or any computational language for that matter, which is why this task is proving to be quite an obstacle for me! Please help!

sequence • 6.0k views
ADD COMMENT
2
Entering edit mode

What does cat *.fasta | grep -e '>' | head -n 10 give you? It would be helpful to see the structure of the sequence headers, in order to provide a proper solution.

ADD REPLY
0
Entering edit mode

> AAA64362/A/Japan/305+/1957

> AAA64363/A/RI/5-/1957

> AAA64364/A/Japan/305-/1957

...and so on

ADD REPLY
0
Entering edit mode

Ok, so how do you want the files named according to these headers?

ADD REPLY
0
Entering edit mode

The full header name as the filename with underscore as separator will be ideal.

ADD REPLY
0
Entering edit mode

There are no underscores in the header, do you want the slash (/) to be replaced with the underscores?

ADD REPLY
1
Entering edit mode

I agree, using forward slash in filenames is not the best idea

ADD REPLY
3
Entering edit mode
6.2 years ago

if you extend it a little you should be safe :

mv seq.fasta $(head -1 seq.fasta | cut -f1 -d ' ' | tr -d '>' ).fasta

This will also take the header up to the first space

and if you want to execute it for a bunch of files :

for i in *.fasta; do 
 mv $i $(head -1 $i | cut -f1 -d ' ' | tr -d '>' ).fasta
done
ADD COMMENT
1
Entering edit mode

I still wouldn't adivse this because that doesn't catch all the potential special characters. You've got to deal with ampersands, pipe symbols, colons, and all sorts of other things. (Which is why I didn't bother extending).

If you know your sequence headers very well, then you might be OK!

ADD REPLY
1
Entering edit mode

true.

but let's assume people are becoming aware from the fact they should not use any special characters in their fasta header IDs ;-) . OK for the pipe symbol but then just add

 | cut -f1 -d '|'

to it. I think it will nonetheless be much faster/efficient then processing the files with python or other.

ADD REPLY
0
Entering edit mode

This provides the perfect output! Thanks!

I was able to set the special character as '/'.

ADD REPLY
0
Entering edit mode

Thanks, this really did it!

Is there an online compendium or some collection of such useful commands using awk, sed, grep etc that will help me become a better bioinformatician? I can of course Google! But any specific suggestion/advice will expedite my search!

ADD REPLY
2
Entering edit mode

Not really, but I keep a personal list of reminders here https://github.com/jrjhealey/bioinfo-tools

I steal them from around the internet when I spot them.

ADD REPLY
0
Entering edit mode

Thank you so much!!!

ADD REPLY
0
Entering edit mode

Same here, I use a similar approach as in collecting them in our lab's wiki page as I come across them.

Nice page of useful oneliners jrj.healey , thx.

ADD REPLY
0
Entering edit mode

You're welcome, though I can't take that much credit as they're mostly if not entirely stolen from others XD

ADD REPLY
1
Entering edit mode
6.2 years ago
Joe 21k

A really easy way to do it would be (for single, not multifastas - unless you want the multifasta named according to the first entry):

mv seq.fasta $(head -1 seq.fasta).fasta

I wouldn't advise it though, as having the > in the file name can cause issues and any spaces or special characters etc will be a bit of a nightmare.

ADD COMMENT
0
Entering edit mode
6.2 years ago
Sej Modha 5.3k

A Python3 solution: It assumes that all fasta files are in the present working directory.

#!/usr/bin/env python3
import os
from Bio import SeqIO

pwd=os.getcwd()

#print(pwd)
for file in os.listdir(pwd):
    #print(file)
    if r'.fa' in file :
        #print(file)
        myfastalist=list(SeqIO.parse(file,'fasta'))
        for record in myfastalist:
            header=record.id
            #print(header)
            outfasta=str(header+'.fa')
            #print(outfasta)
            outfile=open(outfasta,'w')
            outfile.write('>'+str(header)+'\n'+str(record.seq)+'\n')
        outfile.close()
ADD COMMENT
0
Entering edit mode
6.2 years ago
Anima Mundi ★ 2.9k

Hello,

assuming you are working in a UNIX environment, open a terminal window and type nano foo.py

Paste there the following:

for line in open('seq.fasta'):
    if '>' in line:

        filename = line.replace('\n', '').replace('>','').replace(' ','') + '.fasta'
    elif line != '\n':
        text_file = open(filename, 'w')
        text_file.write(line)
        text_file.close()

Press CTRL + O, then CTRL + X

Type pwd in terminal. Copy all your FASTA files and put them in that folder.

Type cat *.fasta >> seq.fasta, and finally (requires Python 2.7) python foo.py

Hope this helps!

ADD COMMENT
0
Entering edit mode
6.2 years ago

You can use pyfaidx for this:

pip install pyfaidx
faidx -x input.fasta

See here for detailed usage.

ADD COMMENT

Login before adding your answer.

Traffic: 3065 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6