Biostar Beta. Not for public use.
Extracting accession number from header using sed
0
Entering edit mode
2.9 years ago
ToastedGoat • 10

Hello! I'm trying to figure out how to extract the accession numbers from the headers. (about 120 headers) I have to use sed and can't seem to figure it out. Here is a sample of what my file looks like:

>Ref.49_cpx.GM.03.N26677.HQ385479 

ATGAGAGTGATGGAGACATGGATG-------------ATTTGCAAAATTG
G------TGG---------------------------AGAGGGGGTCTC

I need the part after the last period in the header. So the "HQ385479" part. Thanks in advance for the help!

ADD COMMENTlink
0
Entering edit mode
cut -d '.' -f6 input.txt
ADD REPLYlink
0
Entering edit mode

I need the part after the last period in the header.

Do you want to keep the rest of the alignments intact? I assume so but please clarify.

Edit: Looks like you want to keep just the accessions based on a response below.

You could do (if all accession lines start with Ref) grep "^>Ref" input.txt | sed 's/^.*\.//g' > accession

ADD REPLYlink
0
Entering edit mode

Ah yes completely forgot I could use grep first. Thanks.

ADD REPLYlink
1
Entering edit mode

Not necessary to use grep. input (copy/pasted the first sequence and changed the id at the end, as second sequence):

>Ref.49_cpx.GM.03.N26677.HQ385479 
ATGAGAGTGATGGAGACATGGATG-------------ATTTGCAAAATTG
G------TGG---------------------------AGAGGGGGTCTC
>Ref.49_cpx.GM.03.N26677.HQ385478
ATGAGAGTGATGGAGACATGGATG-------------ATTTGCAAAATTG
G------TGG---------------------------AGAGGGGGTCTC

output:

$ sed -e  '/>/!d; s/.*\.//g' test.fa 
HQ385479 
HQ385478
ADD REPLYlink
0
Entering edit mode

Thank you so much for this!

ADD REPLYlink
1
Entering edit mode
5 months ago
Joe 12k
United Kingdom

Since the OP has specifically requested sed

$  sed -i 's/^.*\.//g' input.txt

Gives:

HQ385479

ATGAGAGTGATGGAGACATGGATG-------------ATTTGCAAAATTG
G------TGG---------------------------AGAGGGGGTCTC
ADD COMMENTlink
0
Entering edit mode

Thanks! Is there a way within the sed command to remove all the nucleotide sequences as well so I'm just left with all the accession numbers? This is my first time doing any bioinformatics and I am still learning the whole programming/coding side of it all.

ADD REPLYlink
0
Entering edit mode

You can use this:

awk -F. 'NF>1{print $NF}' input.txt > output.txt
ADD REPLYlink
1
Entering edit mode

For future reference: hightlight the text you want to format as code and then click on the "101" button in the edit window to apply the formatting.

ADD REPLYlink
0
Entering edit mode

Is your file a fasta formatted file? (Header lines beginning with >)? Or is it exactly as you posted above?

ADD REPLYlink
0
Entering edit mode

It begins with > Didn't copy in correctly

ADD REPLYlink
1
Entering edit mode

I would just chain it to grep personally, but now the solution is getting a bit less elegant.

cat input.txt | grep ">" | sed 's/^.*\.//g'
ADD REPLYlink
0
Entering edit mode

very minor change to code. Please add > as replacement. So that sequence is still in fasta format.

$ sed -e 's/^.*\./>/g' test1.fa 
>HQ385479 

ATGAGAGTGATGGAGACATGGATG-------------ATTTGCAAAATTG
G------TGG---------------------------AGAGGGGGTCTC
ADD REPLYlink
0
Entering edit mode

The OP said he doesn't want the sequence, just the accession itself (I assuming they're making a list for a table or similar), so there's no need to sub in the ">".

ADD REPLYlink
0
Entering edit mode

okay. didn't read OP in full :)

ADD REPLYlink
0
Entering edit mode
18 months ago
bk11 • 30
 awk -F. 'NF>1{print $NF}' input.txt
ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3