extract same all similar sequences in FASTA based on the header
1
0
Entering edit mode
7.4 years ago
Eva_Maria ▴ 190

Hello I downloaded more than 10k cds from NCBI and Headers look like this

lcl|DS180867.1_cds_EDM56551.1_2 [protein=hemolysin] [protein_id=EDM56551.1] [location=complement(<1..1012)]

lcl|DS180866.1_cds_EDM56342.1_3 [protein=phosphogluconate dehydratase] [protein_id=EDM56342.1] [location=complement(279..>893)]

lcl|DS180865.1_cds_EDM56120.1_4 [protein=3-hydroxyisobutyrate dehydrogenase] [frame=3] [protein_id=EDM56120.1] [location=complement(162..>559)]

lcl|DS180863.1_cds_EDM56465.1_5 [protein=hemolysin] [protein_id=EDM56465.1] [location=218..>977]

lcl|DS180862.1_cds_EDM56350.1_6 [protein=phosphogluconate dehydratase] [protein_id=EDM56350.1] [location=complement(<1..857)]

i want to extract same all similar sequences (based on the header) into a new fasta file based on protein name (example hemolysin)

please suggest any tool or programme for this

sequence alignment fasta • 3.6k views
ADD COMMENT
0
Entering edit mode

Have you tried anything?

ADD REPLY
0
Entering edit mode

Yah i tried with AWK but its not working properly

ADD REPLY
2
Entering edit mode

can you post your awk code, so that people here can fix the problem in it. I feel that this would be a good idea instead of directly getting a working solution.

ADD REPLY
0
Entering edit mode

awk -F "|" '/^>/ {F = $3".fasta"} {print > F}' input.fasta>out.fasta

ADD REPLY
0
Entering edit mode

How do you plan to define similarity based on the header? By the [protein=...] tag?

ADD REPLY
1
Entering edit mode
7.4 years ago

Hi @akhilvbioinfo, remember SeqKit in your last post?

In which, we sort FASTA sequences by protein names? Here we can also split FASTA records by protein names again.

Same strategy: use flag --id-regexp to specify the protein name as the sequence identifier.

$ seqkit split --by-id --id-regexp "\[protein=(.+?)\]" seqs.fa
[INFO] split by ID. idRegexp: \[protein=(.+?)\]
[INFO] read sequences ...
[INFO] read 5 sequences
[INFO] write 2 sequences to file: seqs.fa.split/seqs.id_hemolysin.fa
[INFO] write 2 sequences to file: seqs.fa.split/seqs.id_phosphogluconate dehydratase.fa
[INFO] write 1 sequences to file: seqs.fa.split/seqs.id_3-hydroxyisobutyrate dehydrogenase.fa

Very simple, right?

See other 18 sub commands provided by SeqKit, you may need it again :)

ADD COMMENT
0
Entering edit mode

its very simple sir but i got an error like this

[ERRO] open seqs.fna.split/seqs.id_ketopantoate reductase PanE/ApbA family protein.fna: no such file or directory

ADD REPLY
0
Entering edit mode

I changed headers using this comment ( Seq tool kit wont accept "/" characters)

sed -i -e 's/\//-/g'seqs.fna

Now its working perfectly

Thank you

ADD REPLY
1
Entering edit mode

it's a bug and already fixed since v0.3.8, please download the latest version.

ADD REPLY

Login before adding your answer.

Traffic: 2039 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6