Question

how to save your edited headers in original fasta files?

1

Entering edit mode

2.2 years ago

Priya ▴ 20

hi i'am trying to edit the header of my fasta files using seqkit and i have been able to do it but i'm not able to save it! the command i am using to edit multiple fasta files with respect to their filename and doing it with refseq-

for i in $(find -name \genomid); do 
    seqkit replace -p "^(.+?) (.+?)$" --replacement '{kv}' -k proid_unique *.faa; 
done

The directory having all my fasta files is like this-
PATH: ~/PANGENOMICS/DATA1/test

FILES in the directory:

GCF_000016305.1_ASM1630v1_protein.faa
GCF_000220485.1_ASM22048v1_protein.faa
GCF_900635735.1_32875_B01_protein.faa
proid_unique
genomid

i am finding filenames using a csv file list- genomid

GCF_900635735.1_32875_B01_protein.faa:WP_151362402.1:WP_151362402.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362403.1:WP_151362403.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362404.1:WP_151362404.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362405.1:WP_151362405.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362406.1:WP_151362406.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362407.1:WP_151362407.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362408.1:WP_151362408.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362409.1:WP_151362409.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362410.1:WP_151362410.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362411.1:WP_151362411.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362412.1:WP_151362412.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362413.1:WP_151362413.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362414.1:WP_151362414.1#0940

the file (proid_unique) i used as key-value file to edit the fasta headers look like this-

WP_151362399.1  WP_151362399.1#0940
WP_151362400.1  WP_151362400.1#0940
WP_151362401.1  WP_151362401.1#0940
WP_151362402.1  WP_151362402.1#0940
WP_151362409.1  WP_151362409.1#0940
WP_151362410.1  WP_151362410.1#0940
WP_151362411.1  WP_151362411.1#0940
WP_151362412.1  WP_151362412.1#0940
WP_151362413.1  WP_151362413.1#0940
WP_151362414.1  WP_151362414.1#0940
WP_094096600.1  WP_094096600.1#0945
WP_016530940.1  WP_016530940.1#0950
WP_000940121.1  WP_000940121.1#0951
WP_012540940.1  WP_012540940.1#0951

example of input-

>WP_151362411.1 YoaH family protein [Klebsiella pneumoniae]
MYAPQCSRSKRCFAGLPSLSHEQQQQAVERIHELMAQGISSGQAIALVAEELRATHTGEQ
IVARFEDEDEDE
>WP_151362412.1 gamma-glutamylcyclotransferase [Klebsiella pneumoniae]
MLEAIGGEWRPGYVTGTFYARGWGAAADFPGIVLDAHGPRVNGYLFLSDRLARTGPCWTT
LRRGYDRVPVEVTTDDGQQISAWIYQLQPRG
>WP_151362413.1 acid resistance repetitive basic protein Asr [Klebsiella pneumoniae]
MKKVLALVVAAAMGLSSVAFAADAASTTPSAAASHTTVHHKKHHKAAAKPAAEQKAQAAK
KHHKTAAKTGSRAESAGCKETS
>WP_151362414.1 ABC transporter permease [Klebsiella pneumoniae]
MKRAPWYLRLATWGGVIFLHFPLLIIAIYAFNTEDAAFSFPPQGLTLRWFSEAAGRSDIL
QAVTLSLKIAALSTAIALVLGTLAAGALWRSAFFGKNAVSLLLLLPIALPGIITGLALLT
AFKAVGLEPGLLTIVVGHATFCVVVVFNNVIARFRRTSWSMVEASMDLGATGWQTFRYVV
LPNLGSALLAGGMLAFALSFDEIIVTTFTAGHERTLPLWLLNQLGRPRDVPVTNVVALLV
MLVTTIPILGAWWLTRDGDSDAGNGK

example of output- expected and correct with above command

>WP_151362411.1#0940
MYAPQCSRSKRCFAGLPSLSHEQQQQAVERIHELMAQGISSGQAIALVAEELRATHTGEQ
IVARFEDEDEDE
>WP_151362412.1#0940
MLEAIGGEWRPGYVTGTFYARGWGAAADFPGIVLDAHGPRVNGYLFLSDRLARTGPCWTT
LRRGYDRVPVEVTTDDGQQISAWIYQLQPRG
>WP_151362413.1#0940
MKKVLALVVAAAMGLSSVAFAADAASTTPSAAASHTTVHHKKHHKAAAKPAAEQKAQAAK
KHHKTAAKTGSRAESAGCKETS
>WP_151362414.1#0940
MKRAPWYLRLATWGGVIFLHFPLLIIAIYAFNTEDAAFSFPPQGLTLRWFSEAAGRSDIL
QAVTLSLKIAALSTAIALVLGTLAAGALWRSAFFGKNAVSLLLLLPIALPGIITGLALLT
AFKAVGLEPGLLTIVVGHATFCVVVVFNNVIARFRRTSWSMVEASMDLGATGWQTFRYVV
LPNLGSALLAGGMLAFALSFDEIIVTTFTAGHERTLPLWLLNQLGRPRDVPVTNVVALLV
MLVTTIPILGAWWLTRDGDSDAGNGK

i am getting the required/expected result but this editing is not saving with this command, can someone help me figure out that how to save those editing in the original files bcz when i open those files again they were same as before with no edited header?

Python alternative of the above command used would also be helpful

refseq • 1.3k views

ADD COMMENT • link updated 2.2 years ago by cpad0112 21k • written 2.2 years ago by Priya ▴ 20

1

Entering edit mode

do seqkit replace -p "^(.+?) (.+?)$" --replacement '{kv}' -k accesskey_idsvalue1 *.faa is incorrect part of your code. You are not using variable i in loop. Post output from for i in $(find -name \cusid_genomid); do echo $i;done (fist few lines are helpful). First run seqkit command with single input and output (without using >) and then within loop, replace input with variable and output with a appropriate second variable name.

ADD REPLY • link 2.2 years ago by cpad0112 21k

0

Entering edit mode

i don't get it , i need seqkit replace command to do the editing i cannot remove it. it is not just saving the editing in original file that i neede to solve otherwise this whole command is working

ADD REPLY • link 2.2 years ago by Priya ▴ 20

0

Entering edit mode

Your post is not helpful in addressing the issue.

ADD REPLY • link 2.2 years ago by cpad0112 21k

0

Entering edit mode

sorry for ambiguity, now i have edited again, i hope it's more understable now.

ADD REPLY • link 2.2 years ago by Priya ▴ 20

0

Entering edit mode

i did the editing in header of fasta files but the editing is not saved in original files, whenever i open my fasta files there is not edited header that i did with the alternate awk command-

for i in $(find -name \genomid); do echo $i|\ awk ' FNR==NR{ a[$1]=$2 next } ($2 in a) && /^>/{ print ">"a[$2] next } 1 ' proid_unique FS="[> ]" *.faa; done

ADD REPLY • link 2.2 years ago by Priya ▴ 20

0

Entering edit mode

Loop code is incorrect. Variables are not used, Inputs are incorrect and there are no outputs in loop. It seems to me that your seqkit replacement regex is complex. Try this:

for i in *.faa; do echo modifying $i;  seqkit -w 0 --quiet replace -p "^(\w+\.\d+).*" -r "{kv}" -k proid_unique
 $i -o ${i/\.faa/\_out.fa};  done

Assumptions:

All sequence (fasta) files end with .faa extension.
All sequences in input files have similar header/ID pattern
proid_unique contains key-value pair values matching with headers from sequences

New files will have '.fa' as extension and end with "_out.fa".

ADD REPLY • link 2.2 years ago by cpad0112 21k