Entering edit mode
6.4 years ago
mpbiology.dna
•
0
Hello! I have a FASTA file and I need a script that read in the file, changes all the headers to e new format and writes out all the sequences in a new output file. The modified headers should contain, for each sequence, the species name (with "_" rather than "space"), a space character, and the identifier in square brackets with "ALKBH1:" in front of it. Overall the headers should look like this
> Homo_sapiens [ALKBH1:NP_001192039]
If any of you can generate a script for me with some line descriptions I would really appreciate it. I am a beginner in this field and I need some help. Thank you all in advance!
You could (you should, in fact) start by trying something and ask for help after getting stuck. Then you describe what you tried (show the code), and show some relevant input / output. This question has been asked many times here and elsewhere, you can get started by reading the threads:
Question: Renaming Entries In A Fasta File
Question: How To Rename FASTA Headers
Question: Editing header of a fasta file
how to rename fasta file headers using sed
Modifying FASTA headers with Unix command line tools
Any script to parse fasta headers?
example of input is neded.
Each sequence in the input file looks like this:
Few assumptions: species is Hsa in all the sequences and protein id is alway 6 letters (for eg.ALKBH1). Should work with multifasta file. Code:
output:
MKRTPTAEEREREAKKLRLLEELEDTWLPYLTPKDDEFYQQWQLKYPKLILREASSVSEELHKEVQEAFLTL
input: