How to identify Gap location
1
0
Entering edit mode
12 months ago
Takuma ▴ 20

To submit assembled genome scaffolds for DDBJ, I have to indicate the regions of sequencing gaps.

for example

scaffold1_cov134        assembly_gap    1647..1712
                        assembly_gap    9101..9259
scaffold3_cov149        assembly_gap    1173..1187

It seems to be very time consuming. Does anyone know how to identify Gap (n) location each scaffold ?

Gap scaffold location • 654 views
ADD COMMENT
1
Entering edit mode
12 months ago

Try seqkit locate

$ seqkit locate -P -i -r -G -p 'n+' genome.fa \
    | sed 1d | awk '{print $1"\tassembly_gap\t"$5".."$6}'
scaffold1_cov134        assembly_gap    7..9
scaffold3_cov149        assembly_gap    7..9
scaffold3_cov149        assembly_gap    16..16
scaffold3_cov149        assembly_gap    19..19

Removing duplicated sequence IDs.

$ seqkit locate -P -i -r -G -p 'n+' genome.fa \
    | sed 1d | awk '{print $1"\tassembly_gap\t"$5".."$6}' \
    | awk '{cnt[$1]++; if(cnt[$1]>1)$1=""; print}'
scaffold1_cov134        assembly_gap    7..9
scaffold3_cov149        assembly_gap    7..9
 assembly_gap 16..16
 assembly_gap 19..19
ADD COMMENT
0
Entering edit mode

Thanks, shenwei356. I did it. I appreciate your helpful comment and sepkit!

ADD REPLY

Login before adding your answer.

Traffic: 1774 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6