Parsing gbk file
1
0
Entering edit mode
4.8 years ago
erick_rc93 ▴ 30

I have multiples genbak (.gbk) files, and each file is a concatenated file with multiple chromosomes and plasmids and I would like to split in single files, I'm trying with the next code in awk

awk -v n=1 '/^\/\//{close("out"n);n++;next} {print > "out"n}' filename.gbk

I'd like to get the output file with the same name of input file:

filename_1.gbk
filename_2.gbk
filename_3.gbk
shell • 959 views
ADD COMMENT
1
Entering edit mode

I would strongly suggest using a proper parser like BioPython for this.

If for some reason you cannot, it should be sufficient to split the files up between the LOCUS and // lines.

ADD REPLY
1
Entering edit mode
4.8 years ago
 wget -O - "ftp://ftp.ncbi.nlm.nih.gov/genbank/gbpln64.seq.gz" | gunzip -c | \
awk 'BEGIN{fname="";} /^LOCUS/ {close(fname);fname=sprintf("%s.gbk",$2);} {if(fname!="") print $0 >> fname; }'

$ ls *.gbk | head
CR354457.gbk
CR354458.gbk
CR354459.gbk
CR354460.gbk
CR354461.gbk
CR354462.gbk
CR354463.gbk
CR354464.gbk
CR354465.gbk
CR354466.gbk

EDIT. you want the filename:

sprintf("%s.%s.gbk",FILENAME,$2);
ADD COMMENT

Login before adding your answer.

Traffic: 2645 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6