Question

Rename FASTA files according to FASTA file header

0

Entering edit mode

6.2 years ago

cerulean • 0

In a typical FASTA file, how can the header be used as its filename (i.e., replace the current file name with header ID) ?

I have multiple such FASTA files. I have been scouring the internet to find a simple script that I can use in LINUX to obtain the output, but to no avail. I am not well-versed with programming language or any computational language for that matter, which is why this task is proving to be quite an obstacle for me! Please help!

sequence • 6.0k views

ADD COMMENT • link updated 6.2 years ago by Matt Shirley 10k • written 6.2 years ago by cerulean • 0

2

Entering edit mode

What does cat *.fasta | grep -e '>' | head -n 10 give you? It would be helpful to see the structure of the sequence headers, in order to provide a proper solution.

ADD REPLY • link 6.2 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

> AAA64362/A/Japan/305+/1957

> AAA64363/A/RI/5-/1957

> AAA64364/A/Japan/305-/1957

...and so on

ADD REPLY • link 6.2 years ago by cerulean • 0

0

Entering edit mode

Ok, so how do you want the files named according to these headers?

ADD REPLY • link 6.2 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

The full header name as the filename with underscore as separator will be ideal.

ADD REPLY • link 6.2 years ago by cerulean • 0

0

Entering edit mode

There are no underscores in the header, do you want the slash (/) to be replaced with the underscores?

ADD REPLY • link 6.2 years ago by Sej Modha 5.3k

1

Entering edit mode

I agree, using forward slash in filenames is not the best idea

ADD REPLY • link 6.2 years ago by lieven.sterck 15k

1

Entering edit mode

6.2 years ago

Joe 21k

A really easy way to do it would be (for single, not multifastas - unless you want the multifasta named according to the first entry):

mv seq.fasta $(head -1 seq.fasta).fasta

I wouldn't advise it though, as having the > in the file name can cause issues and any spaces or special characters etc will be a bit of a nightmare.

ADD COMMENT • link 6.2 years ago by Joe 21k

0

Entering edit mode

6.2 years ago

Sej Modha 5.3k

A Python3 solution: It assumes that all fasta files are in the present working directory.

#!/usr/bin/env python3
import os
from Bio import SeqIO

pwd=os.getcwd()

#print(pwd)
for file in os.listdir(pwd):
    #print(file)
    if r'.fa' in file :
        #print(file)
        myfastalist=list(SeqIO.parse(file,'fasta'))
        for record in myfastalist:
            header=record.id
            #print(header)
            outfasta=str(header+'.fa')
            #print(outfasta)
            outfile=open(outfasta,'w')
            outfile.write('>'+str(header)+'\n'+str(record.seq)+'\n')
        outfile.close()

ADD COMMENT • link 6.2 years ago by Sej Modha 5.3k

0

Entering edit mode

6.2 years ago

Anima Mundi ★ 2.9k

Hello,

assuming you are working in a UNIX environment, open a terminal window and type nano foo.py

Paste there the following:

for line in open('seq.fasta'):
    if '>' in line:

        filename = line.replace('\n', '').replace('>','').replace(' ','') + '.fasta'
    elif line != '\n':
        text_file = open(filename, 'w')
        text_file.write(line)
        text_file.close()

Press CTRL + O, then CTRL + X

Type pwd in terminal. Copy all your FASTA files and put them in that folder.

Type cat *.fasta >> seq.fasta, and finally (requires Python 2.7) python foo.py

Hope this helps!

ADD COMMENT • link 6.2 years ago by Anima Mundi ★ 2.9k

0

Entering edit mode

6.2 years ago

Matt Shirley 10k

You can use pyfaidx for this:

pip install pyfaidx
faidx -x input.fasta

See here for detailed usage.

ADD COMMENT • link 6.2 years ago by Matt Shirley 10k

score 3 · Accepted Answer · 2018-02-07

3

Entering edit mode

6.2 years ago

lieven.sterck 15k

if you extend it a little you should be safe :

mv seq.fasta $(head -1 seq.fasta | cut -f1 -d ' ' | tr -d '>' ).fasta

This will also take the header up to the first space

and if you want to execute it for a bunch of files :

for i in *.fasta; do 
 mv $i $(head -1 $i | cut -f1 -d ' ' | tr -d '>' ).fasta
done

ADD COMMENT • link 6.2 years ago by lieven.sterck 15k

1

Entering edit mode

I still wouldn't adivse this because that doesn't catch all the potential special characters. You've got to deal with ampersands, pipe symbols, colons, and all sorts of other things. (Which is why I didn't bother extending).

If you know your sequence headers very well, then you might be OK!

ADD REPLY • link 6.2 years ago by Joe 21k

1

Entering edit mode

true.

but let's assume people are becoming aware from the fact they should not use any special characters in their fasta header IDs ;-) . OK for the pipe symbol but then just add

 | cut -f1 -d '|'

to it. I think it will nonetheless be much faster/efficient then processing the files with python or other.

ADD REPLY • link 6.2 years ago by lieven.sterck 15k

0

Entering edit mode

This provides the perfect output! Thanks!

I was able to set the special character as '/'.

ADD REPLY • link 6.2 years ago by cerulean • 0

0

Entering edit mode

Thanks, this really did it!

Is there an online compendium or some collection of such useful commands using awk, sed, grep etc that will help me become a better bioinformatician? I can of course Google! But any specific suggestion/advice will expedite my search!

ADD REPLY • link 6.2 years ago by cerulean • 0

2

Entering edit mode

Not really, but I keep a personal list of reminders here https://github.com/jrjhealey/bioinfo-tools

I steal them from around the internet when I spot them.

ADD REPLY • link 6.2 years ago by Joe 21k

0

Entering edit mode

Thank you so much!!!

ADD REPLY • link 6.2 years ago by cerulean • 0

0

Entering edit mode

Same here, I use a similar approach as in collecting them in our lab's wiki page as I come across them.

Nice page of useful oneliners jrj.healey , thx.

ADD REPLY • link 6.2 years ago by lieven.sterck 15k

0

Entering edit mode

You're welcome, though I can't take that much credit as they're mostly if not entirely stolen from others XD

ADD REPLY • link 6.2 years ago by Joe 21k