Question

extract same all similar sequences in FASTA based on the header

0

Entering edit mode

7.4 years ago

Eva_Maria ▴ 190

Hello I downloaded more than 10k cds from NCBI and Headers look like this

lcl|DS180867.1_cds_EDM56551.1_2 [protein=hemolysin] [protein_id=EDM56551.1] [location=complement(<1..1012)]

lcl|DS180866.1_cds_EDM56342.1_3 [protein=phosphogluconate dehydratase] [protein_id=EDM56342.1] [location=complement(279..>893)]

lcl|DS180865.1_cds_EDM56120.1_4 [protein=3-hydroxyisobutyrate dehydrogenase] [frame=3] [protein_id=EDM56120.1] [location=complement(162..>559)]

lcl|DS180863.1_cds_EDM56465.1_5 [protein=hemolysin] [protein_id=EDM56465.1] [location=218..>977]

lcl|DS180862.1_cds_EDM56350.1_6 [protein=phosphogluconate dehydratase] [protein_id=EDM56350.1] [location=complement(<1..857)]

i want to extract same all similar sequences (based on the header) into a new fasta file based on protein name (example hemolysin)

please suggest any tool or programme for this

sequence alignment fasta • 3.6k views

ADD COMMENT • link updated 7.4 years ago by shenwei356 8.4k • written 7.4 years ago by Eva_Maria ▴ 190

0

Entering edit mode

Have you tried anything?

ADD REPLY • link 7.4 years ago by venu 7.1k

0

Entering edit mode

Yah i tried with AWK but its not working properly

ADD REPLY • link 7.4 years ago by Eva_Maria ▴ 190

2

Entering edit mode

can you post your awk code, so that people here can fix the problem in it. I feel that this would be a good idea instead of directly getting a working solution.

ADD REPLY • link 7.4 years ago by venu 7.1k

0

Entering edit mode

awk -F "|" '/^>/ {F = $3".fasta"} {print > F}' input.fasta>out.fasta

ADD REPLY • link 7.4 years ago by Eva_Maria ▴ 190

0

Entering edit mode

How do you plan to define similarity based on the header? By the [protein=...] tag?

ADD REPLY • link 7.4 years ago by cschu181 ★ 2.8k

score 1 · Answer 1 · 2016-11-25

1

Entering edit mode

7.4 years ago

shenwei356 8.4k

Hi @akhilvbioinfo, remember SeqKit in your last post?

In which, we sort FASTA sequences by protein names? Here we can also split FASTA records by protein names again.

Same strategy: use flag --id-regexp to specify the protein name as the sequence identifier.

$ seqkit split --by-id --id-regexp "\[protein=(.+?)\]" seqs.fa
[INFO] split by ID. idRegexp: \[protein=(.+?)\]
[INFO] read sequences ...
[INFO] read 5 sequences
[INFO] write 2 sequences to file: seqs.fa.split/seqs.id_hemolysin.fa
[INFO] write 2 sequences to file: seqs.fa.split/seqs.id_phosphogluconate dehydratase.fa
[INFO] write 1 sequences to file: seqs.fa.split/seqs.id_3-hydroxyisobutyrate dehydrogenase.fa

Very simple, right?

See other 18 sub commands provided by SeqKit, you may need it again :)

ADD COMMENT • link 7.4 years ago by shenwei356 8.4k

0

Entering edit mode

its very simple sir but i got an error like this

[ERRO] open seqs.fna.split/seqs.id_ketopantoate reductase PanE/ApbA family protein.fna: no such file or directory