Command or Perl script for changing headers of multiple FASTA files in a specific order listed in a txt file
3
0
Entering edit mode
6.6 years ago

Assalam o alaikum everyone,

I am working with multiple genes and in each gene folder i have multiple FASTA (70-75) files and each FASTA file contains single gene sequence. e.g.

AMY2b_Gene_folder

Chimpanzee_AMY2B_CDS.fasta
Human_AMY2B_CDS.fasta
Pygmy_chimpanzee_AMY2B_CDS.fasta
Western_gorrila_AMY2B_CDS.fasta

cat Chimpanzee_AMY2B_CDS.fasta

>lcl|NM_020978.4_cds_NP_066188.1_1 [gene=AMY2B] [protein=alpha-amylase 2B precursor] [protein_id=NP_066188.1] [location=673..2208]
ATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAACAAG
GACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGCGATATTT
AGCTCCCAAGGGATTTGGAGGGGTTCAGGTCTCTCCACCAAATGAAAATGTTGCAATTCACAACCCTTTC

Human_AMY2B_CDS.fasta

 >lcl|NM_020978.4_cds_NP_066188.1_1 [gene=AMY2B] [protein=alpha-amylase 2B precursor] [protein_id=NP_066188.1] [location=673..2208]
> ATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAACAAG
> GACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGCGATATTT
> AGCTCCCAAGGGATTTGGAGGGGTTCAGGTCTCTCCACCAAATGAAAATGTTGCAATTCACAACCCTTTC

I want to change headers of each fasta file according to a specific order given in text file.

cat Headers.txt
MP.C_AMY2B
FP.H_AMY2B

The output should be look like

>MP.C_AMY2B
 ATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAACAAG GACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGCGATATTT AGCTCCCAAGGGATTTGGAGGGGTTCAGGTCTCTCCACCAAATGAAAATGTTGCAATTCACAACCCTTTC

I have tried perl script given in following biostar posts but these scripts did not worked for multiple FASTA files which have single gene sequence.

Renaming Entries In A Fasta File

Renaming fasta headers according to a matching name list

Kindly guide me is there any command-line solution to do so????

editing FASTA headers text processing command line • 5.2k views
ADD COMMENT
1
Entering edit mode

Without the mapping rule of names in Headers.txt and the FASTA files, we can't rename them rightly.

Chimpanzee_AMY2B_CDS.fasta                 MP.C_AMY2B
Human_AMY2B_CDS.fasta                      FP.H_AMY2B
Pygmy_chimpanzee_AMY2B_CDS.fasta           ????
Western_gorrila_AMY2B_CDS.fasta            ?????
ADD REPLY
2
Entering edit mode
6.6 years ago

Run this command:

$ for i in $(sed -n '=' headers.txt); do sed -n "$i"p headers.txt| sed 's/^/>/'; cat $(ls *.fasta| sed -n "$i"p)| sed '1d'; done

output:

>MP.C_AMY2B
ATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAACAAG
GACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGCGATATTT
AGCTCCCAAGGGATTTGGAGGGGTTCAGGTCTCTCCACCAAATGAAAATGTTGCAATTCACAACCCTTTC
>FP.H_AMY2B
ATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAACAAG
GACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGCGATATTT
AGCTCCCAAGGGATTTGGAGGGGTTCAGGTCTCTCCACCAAATGAAAATGTTGCAATTCACAACCCTTTC

Assumptions

  1. Order fasta sequences are in the same order as the headers to be replaced in headers.txt
  2. User default shell is bash

Note: Output to the screen. You can redirect to another fasta file.

Input:

$ cat headers.txt 
MP.C_AMY2B
FP.H_AMY2B

$ cat *.fasta
>lcl|NM_020978.4_cds_NP_066188.1_1 [gene=AMY2B] [protein=alpha-amylase 2B precursor] [protein_id=NP_066188.1] [location=673..2208]
ATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAACAAG
GACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGCGATATTT
AGCTCCCAAGGGATTTGGAGGGGTTCAGGTCTCTCCACCAAATGAAAATGTTGCAATTCACAACCCTTTC
>lcl|NM_020978.4_cds_NP_066188.1_1 [gene=AMY2B] [protein=alpha-amylase 2B precursor] [protein_id=NP_066188.1] [location=673..2208]
ATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAACAAG
GACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGCGATATTT
AGCTCCCAAGGGATTTGGAGGGGTTCAGGTCTCTCCACCAAATGAAAATGTTGCAATTCACAACCCTTTC
ADD COMMENT
0
Entering edit mode

Thank u so much, your solution is useful. i want to redirect output in the multiple input fasta files.

ADD REPLY
1
Entering edit mode

create a folder called test (in the same folder where fasta files are located) and run the command:

$ mkdir test

execute:

for i in $(seq 1 $(ls *.fasta |wc -l)); do sed -n "$i"p headers.txt| sed 's/^/>/'> test/$(ls *.fasta| sed -n "$i"p); cat $(ls *.fasta| sed -n "$i"p)| sed '1d' >>test/$(ls *.fasta| sed -n "$i"p); done

files in test folder will have exact names as fasta files in current directory. Let me know if there are any issues with the script.

ADD REPLY
0
Entering edit mode

Thank u so much it's working.. You made my day :)

ADD REPLY
1
Entering edit mode
6.6 years ago
tiago211287 ★ 1.4k

First, create a backup of all files before trying this approach.

Taking a list of new headers, namely, Headers.txt

cat Headers.txt
MP.C_AMY2B
FP.H_AMY2B

in the same order of one fasta files list

find . -maxdepth 1 -name "*CDS.fasta" | sort
./Chimpanzee_AMY2B_CDS.fasta
./Human_AMY2B_CDS.fasta

put the following function in your linux enviroment:

function sedinho () { sed -i  "s/^.*\]/>$1/g" $2;}
export -f sedinho

create variables of : list of new headers (LIST1) list of input files (LIST2)

LIST1=($(cat Headers.txt))
LIST2=($(find /folder/with/fasta/files/ -maxdepth 0 -name "*CDS.fasta" | sort))

parallel --xapply sedinho {1} {2} ::: ${LIST1[@]} ::: ${LIST2[@]}


>MP.C_AMY2B
ATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAACAAG
GACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGCGATATTT
AGCTCCCAAGGGATTTGGAGGGGTTCAGGTCTCTCCACCAAATGAAAATGTTGCAATTCACAACCCTTTC
ADD COMMENT
0
Entering edit mode

Hi tiago211287 , Thank you for your reply. I have tried your solution but the following error occurred.

zsh:1: command not found: sedinho
zsh:1: command not found: sedinho
zsh:1: command not found: sedinho

for putting function in my linux environment i have simply copied function in a text file and copied the file in bin folder. but the above error occurred again can you tell me how to fix it ????

ADD REPLY
0
Entering edit mode

in order to work you must execute this two lines in your terminal:

function sedinho () { sed -i  "s/^.*\]/>$1/g" $2;}
export -f sedinho

you dont need to put this inside a file

ADD REPLY
0
Entering edit mode

yup,, Firstly i executed above lines in my terminal but same error occurred. :(

ADD REPLY
0
Entering edit mode

can you repeat what you're doing and post a screenshot of your screen here?

ADD REPLY
0
Entering edit mode

https://ibb.co/bz2ita![enter image description here][1]

https://ibb.co/dd3v6v![enter image description here][2]

https://ibb.co/e8tnmv

ADD REPLY
1
Entering edit mode

You can't directly attach files to Biostars posts. Use a free image hosting provider (click on the icon next to "101" one in edit window to get some suggestions) to upload your images and then insert those links here.

ADD REPLY
0
Entering edit mode

this is strange to me, it appears that your shell is not exporting the function properly

ADD REPLY
2
Entering edit mode

This user appears to be using zsh and that may be the difference. Switching to bash may do the trick.

ADD REPLY
1
Entering edit mode
6.6 years ago
biolab ★ 1.4k

Please check the following commnds. First, run cd AMY2b_Gene_folder/; mkdir new/. Then, run for f in *.fasta; do perl -e '$in=$ARGV[0]=~s/(.+?)\.fasta/$1/r; while (<>){if (/>/) {print ">$in\n"} else {print} }' $f > new/$f; done

ADD COMMENT
0
Entering edit mode

Hi biolab,

Thank you for your reply. But i have new FASTA headers in a separate text file according to fasta files oder above code replace headers with FASTA file name not with the headers in the headers file. can u guide me where to use headers file???

P.S I'm new in bioinformatics world happy to give more information.

ADD REPLY

Login before adding your answer.

Traffic: 2132 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6