Modify FASTA headers
0
0
Entering edit mode
6.4 years ago

Hello! I have a FASTA file and I need a script that read in the file, changes all the headers to e new format and writes out all the sequences in a new output file. The modified headers should contain, for each sequence, the species name (with "_" rather than "space"), a space character, and the identifier in square brackets with "ALKBH1:" in front of it. Overall the headers should look like this

 > Homo_sapiens [ALKBH1:NP_001192039]

If any of you can generate a script for me with some line descriptions I would really appreciate it. I am a beginner in this field and I need some help. Thank you all in advance!

FASTA Python script headers • 2.8k views
ADD COMMENT
3
Entering edit mode

You could (you should, in fact) start by trying something and ask for help after getting stuck. Then you describe what you tried (show the code), and show some relevant input / output. This question has been asked many times here and elsewhere, you can get started by reading the threads:

Question: Renaming Entries In A Fasta File

Question: How To Rename FASTA Headers

Question: Editing header of a fasta file

how to rename fasta file headers using sed

Modifying FASTA headers with Unix command line tools

Any script to parse fasta headers?

ADD REPLY
1
Entering edit mode

example of input is neded.

ADD REPLY
0
Entering edit mode

Each sequence in the input file looks like this:

>gi|122937263|ref|NP_001073901.1|/1-505 alkB1 DNA repair protein ALKBH1 [Homo sapiens]
MKRTPTAEEREREAKKLRLLEELEDTWLPYLTPKDDEFYQQWQLKYPKLILREASSVSEELHKEVQEAFLTL
HKHGCLFRDLVRIQGKDLLTPVSRILIGNPGCTYKYLNTRLFTVPWPVKGSNIKHTEAEIAAACETFLKLND
YLQIETIQALEELAAKEKANEDAVPLCMSADFPRVGMGSSYNGQDEVDIKSRAAYNVTLLNFMDPQKMPYLK
EEPYFGMGKMAVSWHHDENLVDRSAVAVYSYSCEGPEEESEDDSHLEGRDPDIWHVGFKISWDIETPGLAIP
LHQGDCYFMLDDLNATHQHCVLAGSQPRFSSTHRVAECSTGTLDYILQRCQLALQNVCDDVDNDDVSLKSFE
PAVLKQGEEIHNEVEFEWLRQFWFQGNRYRKCTDWWCQPMAQLEALWKKMEGVTNAVLHEVKREGLPVEQRN
EILTAILASLTARQNLRREWHARCQSRIARTLPADQKPECRPYWEKDDASMPLPFDLTDIVSELRGQLLEAK
P
ADD REPLY
0
Entering edit mode

Few assumptions: species is Hsa in all the sequences and protein id is alway 6 letters (for eg.ALKBH1). Should work with multifasta file. Code:

$ sed  '/^>/ s/.*\(NP.*\)|.*protein\s\(.\{6\}\)\s\[\(.\{4\}\)\s\(.\{7\}\).*/>\3_\4 \[\2: \1\]/g' test.fa

output:

>Homo_sapiens [ALKBH1: NP_001073901.1]

MKRTPTAEEREREAKKLRLLEELEDTWLPYLTPKDDEFYQQWQLKYPKLILREASSVSEELHKEVQEAFLTL

input:

$ cat test.fa 
>gi|122937263|ref|NP_001073901.1|/1-505 alkB1 DNA repair protein ALKBH1 [Homo sapiens]
MKRTPTAEEREREAKKLRLLEELEDTWLPYLTPKDDEFYQQWQLKYPKLILREASSVSEELHKEVQEAFLTL
ADD REPLY

Login before adding your answer.

Traffic: 1968 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6