Biostar Beta. Not for public use.
Question: Editing header of a fasta file
0
Entering edit mode

Hello everybody, I am not and expert with sed and I am sure that someone will do this work faster and better than me.

I would like to edit multiple fasta header from this format.

>M01380:50:000000000-AV1DH:1:1101:16094:3001 1:N:0:M636:16S_V1V3 TTCTGCCT|0|TAGACCTA|0 CS1_534R_YM3_for|3|27|

to this one:

>M636

As you can see "M636" is embedded in the mayor header.

Thank you for always helping everybody!

D.

ADD COMMENTlink 2.8 years ago DVR • 20 • updated 2.8 years ago 5heikki 8.4k
Entering edit mode
2

Sure. But did you try anything?

Just a comment:

someone will do this work faster and better than me

There will always be people ahead of us. But this shouldn't hinder our learning :)

ADD REPLYlink 2.8 years ago
venu
6.2k
Entering edit mode
0

Hello @venu

Sure, I am trying with sed (since awk seems to be more complicated). This post from @noirot.celine gave me some ideas of what to do (https://www.biostars.org/p/53212/#188850). But I am having a hard time figuring out the regular expressions.

I understand your concern. I will keep trying... :)

ADD REPLYlink 2.8 years ago
DVR
• 20
Entering edit mode
0

Please use ADD COMMENT or ADD REPLY to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your post but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.

ADD REPLYlink 2.8 years ago
WouterDeCoster
39k
Entering edit mode
0

Thank you all for your solutions,

Seidel's proposal worked beautifully!

Thanks also @novice, however maybe I did not explain well myself but the "M" may change to other letter or not be present.

ADD REPLYlink 2.8 years ago
DVR
• 20
Entering edit mode
0

Please use ADD COMMENT or ADD REPLY to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your post but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. Upvote|Bookmark|Accept

ADD REPLYlink 2.8 years ago
WouterDeCoster
39k
2
Entering edit mode

If your field of interest is always in the same position, you can try awk with the split() function and do something like the following:

awk '{split($0,a,":"); if(a[10]) print ">"a[10]; else print; }' yourfile.fasta > newfile.fasta

This translates to: split the input line by semicolon and put the results into array a, if a is defined, print a modified version using the 10th field, if not just print the input line.

ADD COMMENTlink 2.8 years ago seidel 6.8k
Entering edit mode
0

Marked as accepted because OP didn't...

ADD REPLYlink 2.7 years ago
WouterDeCoster
39k
1
Entering edit mode

I think the answer above is safe and correct. Here's a sed solution that is less safe (we don't know what is changing between the headers):

$ cat old.fasta | sed 's/>.*:\(M[0-9]*\):.*/>\1/' > new.fasta

Edit: excluded colons from headers.

ADD COMMENTlink 2.8 years ago novice • 920
0
Entering edit mode
cut -f1 -d ":" in.fa > out.fa

edit. nevermind, thought you wanted the first "field"..

ADD COMMENTlink 2.8 years ago 5heikki 8.4k

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0