Biostar Beta. Not for public use.
Rename FASTA files according to FASTA file header
0
Entering edit mode
2.8 years ago
cerulean • 0

In a typical FASTA file, how can the header be used as its filename (i.e., replace the current file name with header ID) ?

I have multiple such FASTA files. I have been scouring the internet to find a simple script that I can use in LINUX to obtain the output, but to no avail. I am not well-versed with programming language or any computational language for that matter, which is why this task is proving to be quite an obstacle for me! Please help!

sequence • 1.3k views
ADD COMMENTlink
2
Entering edit mode

What does cat *.fasta | grep -e '>' | head -n 10 give you? It would be helpful to see the structure of the sequence headers, in order to provide a proper solution.

ADD REPLYlink
0
Entering edit mode

> AAA64362/A/Japan/305+/1957

> AAA64363/A/RI/5-/1957

> AAA64364/A/Japan/305-/1957

...and so on

ADD REPLYlink
0
Entering edit mode

Ok, so how do you want the files named according to these headers?

ADD REPLYlink
0
Entering edit mode

The full header name as the filename with underscore as separator will be ideal.

ADD REPLYlink
0
Entering edit mode

There are no underscores in the header, do you want the slash (/) to be replaced with the underscores?

ADD REPLYlink
1
Entering edit mode

I agree, using forward slash in filenames is not the best idea

ADD REPLYlink
3
Entering edit mode
8 months ago
VIB, Ghent, Belgium

if you extend it a little you should be safe :

mv seq.fasta $(head -1 seq.fasta | cut -f1 -d ' ' | tr -d '>' ).fasta

This will also take the header up to the first space

and if you want to execute it for a bunch of files :

for i in *.fasta; do 
 mv $i $(head -1 $i | cut -f1 -d ' ' | tr -d '>' ).fasta
done
ADD COMMENTlink
1
Entering edit mode

I still wouldn't adivse this because that doesn't catch all the potential special characters. You've got to deal with ampersands, pipe symbols, colons, and all sorts of other things. (Which is why I didn't bother extending).

If you know your sequence headers very well, then you might be OK!

ADD REPLYlink
1
Entering edit mode

true.

but let's assume people are becoming aware from the fact they should not use any special characters in their fasta header IDs ;-) . OK for the pipe symbol but then just add

 | cut -f1 -d '|'

to it. I think it will nonetheless be much faster/efficient then processing the files with python or other.

ADD REPLYlink
0
Entering edit mode

This provides the perfect output! Thanks!

I was able to set the special character as '/'.

ADD REPLYlink
0
Entering edit mode

Thanks, this really did it!

Is there an online compendium or some collection of such useful commands using awk, sed, grep etc that will help me become a better bioinformatician? I can of course Google! But any specific suggestion/advice will expedite my search!

ADD REPLYlink
2
Entering edit mode

Not really, but I keep a personal list of reminders here https://github.com/jrjhealey/bioinfo-tools

I steal them from around the internet when I spot them.

ADD REPLYlink
0
Entering edit mode

Thank you so much!!!

ADD REPLYlink
0
Entering edit mode

Same here, I use a similar approach as in collecting them in our lab's wiki page as I come across them.

Nice page of useful oneliners jrj.healey , thx.

ADD REPLYlink
0
Entering edit mode

You're welcome, though I can't take that much credit as they're mostly if not entirely stolen from others XD

ADD REPLYlink
1
Entering edit mode
9 months ago
Joe 12k
United Kingdom

A really easy way to do it would be (for single, not multifastas - unless you want the multifasta named according to the first entry):

mv seq.fasta $(head -1 seq.fasta).fasta

I wouldn't advise it though, as having the > in the file name can cause issues and any spaces or special characters etc will be a bit of a nightmare.

ADD COMMENTlink
0
Entering edit mode
8 months ago
Sej Modha 4.2k
Glasgow, UK

A Python3 solution: It assumes that all fasta files are in the present working directory.

#!/usr/bin/env python3
import os
from Bio import SeqIO

pwd=os.getcwd()

#print(pwd)
for file in os.listdir(pwd):
    #print(file)
    if r'.fa' in file :
        #print(file)
        myfastalist=list(SeqIO.parse(file,'fasta'))
        for record in myfastalist:
            header=record.id
            #print(header)
            outfasta=str(header+'.fa')
            #print(outfasta)
            outfile=open(outfasta,'w')
            outfile.write('>'+str(header)+'\n'+str(record.seq)+'\n')
        outfile.close()
ADD COMMENTlink
0
Entering edit mode
18 months ago
Anima Mundi ♦ 2.4k
Italy

Hello,

assuming you are working in a UNIX environment, open a terminal window and type nano foo.py

Paste there the following:

for line in open('seq.fasta'):
    if '>' in line:

        filename = line.replace('\n', '').replace('>','').replace(' ','') + '.fasta'
    elif line != '\n':
        text_file = open(filename, 'w')
        text_file.write(line)
        text_file.close()

Press CTRL + O, then CTRL + X

Type pwd in terminal. Copy all your FASTA files and put them in that folder.

Type cat *.fasta >> seq.fasta, and finally (requires Python 2.7) python foo.py

Hope this helps!

ADD COMMENTlink
0
Entering edit mode
18 months ago
Cambridge, MA

You can use pyfaidx for this:

pip install pyfaidx
faidx -x input.fasta

See here for detailed usage.

ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3.1