Extract Features From Gbk Via Bash
1
0
Entering edit mode
10.1 years ago
Phil S. ▴ 700

Hi

is there any way to extract the locustag , join and complement information from a genbank via bash (one liner)? What gives me problems is the line break which may occur in the complement or join field of the genbank entry!

Thanks in advance.

Phil

ps. the lines I want to extract look like this:

/locus_tag="XXX_00002"


join(5485..5700,5784..6116,6230..6377,6434..6574,

6683..6819,7047..7314,7428..7938,8031..8307)


complement(join(34726..34766,34850..34866,34975..35287,

35392..35518,35604..35744,35831..35928,36051..36156))
bash genbank feature • 3.2k views
ADD COMMENT
2
Entering edit mode

Maybe, but it'd be a lot easier to use something like biopython or bioperl that can parse genbank files.

ADD REPLY
0
Entering edit mode

historically this is why perl, python exists in the first place - so that we don't have to process text in bash

ADD REPLY
0
Entering edit mode

Normally i solve stuff like this using boppython. The thing was that I wasn't working with my own laptop so i just needed to came up with a neat trick to solve it just the one time... Thanks for your suggestions anyways!

ADD REPLY
3
Entering edit mode
10.1 years ago
xb ▴ 420

sed is one option.

The sample file,

cat sample.gbk

## output 
## ignored, same as in the question

To join the unwanted line breaks,

sed '/^$/d' sample.gbk | sed 'N;/,\n/s/"\? *\n//;P;D' 

## output
/locus_tag="XXX_00002"
join(5485..5700,5784..6116,6230..6377,6434..6574,6683..6819,7047..7314,7428..7938,8031..8307)
complement(join(34726..34766,34850..34866,34975..35287,35392..35518,35604..35744,35831..35928,36051..36156))

Then there are many ways to capture the contents. Below is one way to capture the "join" information,

sed '/^$/d' sample.gbk | sed 'N;/,\n/s/"\? *\n//;P;D' | sed -n 's/^join(\(.\+\))$/\1/p'

## output
5485..5700,5784..6116,6230..6377,6434..6574,6683..6819,7047..7314,7428..7938,8031..8307

Or all information in one line,

sed '/^$/d' sample.gbk | sed 'N;/,\n/s/"\? *\n//;P;D' | tr "\n" ";" | sed -r 's/^\/locus_tag="(.+)".*join\(([^()]+)\).*complement\(join\(([^()]+)\)\).*/\1\n\2\n\3\n/g'

## output
XXX_00002
5485..5700,5784..6116,6230..6377,6434..6574,6683..6819,7047..7314,7428..7938,8031..8307
34726..34766,34850..34866,34975..35287,35392..35518,35604..35744,35831..35928,36051..36156
ADD COMMENT
0
Entering edit mode

Thank you, works really nice!

ADD REPLY

Login before adding your answer.

Traffic: 2705 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6