Fasta header trimming
4
3
Entering edit mode
7.2 years ago

I have a fasta file with hundreds of sequences and their respective headers. The headers (all of them) are in the format

>ABCD [id_123] (gene_XYZ) [protein_ijk] [protein_id=qqq] [123..899]

.......seqeunce............

>EFGH [id_999] (gene_PQR) [protein_tre] [protein_id=trs] [573..789]
......seqeunce............

and so on.....

For the header every info in parenthesis are continuous and are only separated by a single space each (just as written above). All I want to do is retain "ABCD" (the very first info) in the header corresponding to every sequence . I want to loop through all the headers that are present in the file and return something like this :

>ABCD   
.....sequence.....
>EFGH  
.....sequence.......

and so on......Any help is most appreciated and i am working with BASH and perl.

Thank and regards!

sequence • 12k views
ADD COMMENT
1
Entering edit mode

With reformat.sh from BBMap suite: reformat.sh in=your.fa out=new.fa trd=t

ADD REPLY
0
Entering edit mode

I dont know why the sequence is showing next to the header when i posted this here! Of course it is a fasta file and hence the sequences are directly below the headers.

ADD REPLY
0
Entering edit mode

I have reformatted your post to show the correct format of fasta files.

ADD REPLY
0
Entering edit mode

Okay. Thanks for that...

ADD REPLY
17
Entering edit mode
7.2 years ago
GenoMax 141k

In bash: cut -d ' ' -f1 your_file.fa > new_file.fa

ADD COMMENT
0
Entering edit mode

Worked well.....thanks!

ADD REPLY
0
Entering edit mode

That's Wizard level amazingness there!!! Thanks!

ADD REPLY
1
Entering edit mode
7.2 years ago

Lightning fast solution using seqkit (usage of seqkit seq), just download the binary file for Linux, decompress and immediately run:

./seqkit seq -i seqs.fa > seqs2.fa
ADD COMMENT
1
Entering edit mode
7.2 years ago

If you want a fast and generic/flexible approach to parsing FASTA, use awk:

$ awk 'BEGIN{RS=">";}NR>1{ split($1,a," "); print ">"a[0]"\n"$2; }' in.fasta > out.fasta
ADD COMMENT
0
Entering edit mode

Thanks for all your replies ! However, I was looking to retain the actual sequences also along with the shortened header. Infact I was myself able to shorten the header to the first word before posting this here. BUt the problem was I could not figure out a way to shorten the header and keep the sequences intact. So, I was wondering if there is any way I could also retain the actual sequences along with the headers. In short- I want to retain the entire sequence intact for all the entries and just shorten the name of the headers.

Meaning :

ABCD [id_123] (gene_XYZ) [protein_ijk] [protein_id=qqq] [123..899] .......seqeunce............

will now look like

ABCD .....sequence.....

and I need to do this for the whole fasta file (all headers and sequences)

Thanks again!

ADD REPLY
0
Entering edit mode

So you don't have a standard fasta format sequence file where first line is an identifier >some_id and the sequence follows on the second line? If you had said this yesterday then I would have reset the formatting. My apologies.

Your sequence is present on the same line as the identifier and you want to keep it on that line after shortening the header. Is that correct?

If the length of the extra stuff is always the same in all sequences then see if this works cut -d ' ' -f1,7 your.fa > new.fa

I am leaving this post here since child posts will disappear if I delete this. Content is no longer applicable.

ADD REPLY
0
Entering edit mode

sorry again genomax2, the formatting got disrupted again in my posting today. It is exactly the way it was posted yesterday. (top of my post) the sequence IS on the second line (beneath the header) as it should be in a regular fasta file

Alex,

Here is a snippet from my actual fasta file:

>ABCD_gene_1 [gene=XYZ] [location=1..2231]
AGAGTGTATTGATGAGTCTCCATGGGAATGTGAAATGGCTGAAATGTTTGAAGAAACTGTTTTAGGGGAT
AGGAGACTTGGGAGTATGTATGGGGTGGTTGGATTTGTCACATTGATCTTGTTGGGGACATGCATGTAAA
...............................

>PQRS_gene_2 [gene=PQR] [location=1..2618]
CGCGCGGCCCAGGAAGTGCGCGCTGTTGCCCCGGAAGTGCACGCGCGGGGTCACCGGAAGCGGCGCGTCG
GGAGGATGCCGCTGCCTGTCCAGGTCTTCAACCTGCAGGGTGCAGTGGAGCCCATGCAGATTGATGTGGA
........

The list goes on for another several hundred sequences. The final output that I am looking for is:

>ABCD_gene_1
AGAGTGTATTGATGAGTCTCCATGGGAATGTGAAATGGCTGAAATGTTTGAAGAAACTGTTTTAGGGGAT
AGGAGACTTGGGAGTATGTATGGGGTGGTTGGATTTGTCACATTGATCTTGTTGGGGACATGCATGTAAA
...............................

>PQRS_gene_2
CGCGCGGCCCAGGAAGTGCGCGCTGTTGCCCCGGAAGTGCACGCGCGGGGTCACCGGAAGCGGCGCGTCG
GGAGGATGCCGCTGCCTGTCCAGGTCTTCAACCTGCAGGGTGCAGTGGAGCCCATGCAGATTGATGTGGA
........

and so on for all thee headers and sequences... for the rest of the whole fasta file.

I am hoping the post will come up correctly formatted this time. Otherwise, please know that my file looks just like regular fasta file (as correctly formatted by genomax2 yesterday).Hope this helps!

Thanks !

ADD REPLY
0
Entering edit mode

For future reference, it is safer to use the "code" formatting tool (101010 button) when formatting things like code/file formats. I have done this for you (and also reset the format original question).

So you do have a normal fasta file (since there is no formal spec for fasta this would do).

Edit: @Alex's solution as posted above did not work with the example data posted today (which is not the same as the original post).

ADD REPLY
0
Entering edit mode

Thanks genomax2 ! Infact your posted asnwer (cut command) from yesterday is doing the job fine. I must have erred on something, which gave me a different result last night and I appreciate you pointing out the formatting tool button. I am new to the forum and I will surely take care of these stuff before posting next time.

My issue is solved- and thanks to all who took time in contributing your answers. I learnt several new ways in managing such scripting situations for the future. Thanks everyone!

ADD REPLY
0
Entering edit mode

My awk statement should shorten headers and preserve sequences in FASTA files. I'm unclear what the issue is on your side, but if you want to post a snippet of your file and what results you're getting, I'd be happy to try to help.

ADD REPLY
0
Entering edit mode

Hey Alex,

I have a similar problem, could you please help. I also want to short my fasta file header, which is like below, I listed two sequences.

>lcl|VSMA01000001.1_prot_KAB5584702.1_1 [locus_tag=GE09DRAFT_1165795] [db_xref=InterPro:IPR002198,JGIDB:Conioc1_1165795] [protein=tetrahydroxynaphthalene reductase] [protein_id=KAB5584702.1] [location=join(1826..1931,1988..2458,2736..2863,2927..3064)] [gbkey=CDS]
MPGLTTNTGKYDQIPGPLGLASASLEGKVALVTGAGRGIGREMAQELGRRGAKVIVNYANSQESAEEVVQAIKKSGSDAA
SIKANVSDVDQIVRMFDEAVKVFGKLDIVCSNSGVVSFGHVKDVTPEEFDRVFNINTRGQFFVAREAYKHLEVGGRLILM
GSITGQAKGVPKHAVYSGSKGTIETFVRCMAIDFGDKKITVNAVAPGGIKTDMYHAVCREYIPNGINLTDDEVDEYACTW
SPLHRVGLPIDIARVVCFLASQDGEWINGKVLGIDGAACM
>lcl|VSMA01000001.1_prot_KAB5584703.1_2 [locus_tag=GE09DRAFT_1165796] [db_xref=InterPro:IPR021840,JGIDB:Conioc1_1165796] [protein=hypothetical protein] [protein_id=KAB5584703.1] [location=complement(join(3193..3215,3871..4374,4440..5628,5725..5886,5941..5989,6050..6066,6130..6234,6286..6495,6547..6561,6622..6728,6843..7103,7155..7719))] [gbkey=CDS]
MFHPSRRRAEQTAYEYNIQATEDHEHDHGVVNLSAEKRRRPRGKRPNYKPTALKWPFIVAQILVLVIAMGLIIWAEKAMP
DSDSTAIIDPLPSKGLPERSVKPEFGKHFRRDNTSGVVETATSQLDVQETTLTGGDGLITPGLGSTNGPADNVKTAVTDD

And I only want to keep the header like this:

>GE09DRAFT_1165795
MPGLTTNTGKYDQIPGPLGLASASLEGKVALVTGAGRGIGREMAQELGRRGAKVIVNYANSQESAEEVVQAIKKSGSDAA
SIKANVSDVDQIVRMFDEAVKVFGKLDIVCSNSGVVSFGHVKDVTPEEFDRVFNINTRGQFFVAREAYKHLEVGGRLILM
GSITGQAKGVPKHAVYSGSKGTIETFVRCMAIDFGDKKITVNAVAPGGIKTDMYHAVCREYIPNGINLTDDEVDEYACTW
SPLHRVGLPIDIARVVCFLASQDGEWINGKVLGIDGAACM

I would be super greatful for your help.

Thanks. Yanfang

ADD REPLY
0
Entering edit mode

See if this works:

$ awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < your.fa | awk ' {if ($0 ~ /^>/) {split($0,a,"="); split(a[2],b,"]"); print ">"b[1]} {print $9} }' | tr "\t" "\n"  | fold -w 80 
>GE09DRAFT_1165795
MPGLTTNTGKYDQIPGPLGLASASLEGKVALVTGAGRGIGREMAQELGRRGAKVIVNYANSQESAEEVVQAIKKSGSDAA
SIKANVSDVDQIVRMFDEAVKVFGKLDIVCSNSGVVSFGHVKDVTPEEFDRVFNINTRGQFFVAREAYKHLEVGGRLILM
GSITGQAKGVPKHAVYSGSKGTIETFVRCMAIDFGDKKITVNAVAPGGIKTDMYHAVCREYIPNGINLTDDEVDEYACTW
SPLHRVGLPIDIARVVCFLASQDGEWINGKVLGIDGAACM
>GE09DRAFT_1165796
MFHPSRRRAEQTAYEYNIQATEDHEHDHGVVNLSAEKRRRPRGKRPNYKPTALKWPFIVAQILVLVIAMGLIIWAEKAMP
DSDSTAIIDPLPSKGLPERSVKPEFGKHFRRDNTSGVVETATSQLDVQETTLTGGDGLITPGLGSTNGPADNVKTAVTDD
ADD REPLY
0
Entering edit mode
7 months ago
Ankit • 0

python3.9 code

import os

#give input filename on path i.e main file to change

file = open("/home/ankit/RWork/GISAID/hCoV-19_spikenuc0810/spikenuc0810_seqkit_output.fasta",'r')

#give header index size e.g here first 47 characeters of header will be printed
header_index_upto = 47

#default output filename is "spikenuc0810_Final.fasta" but you can replace...
if "spikenuc0810_Final.fasta" not in os.listdir("/home/ankit/RWork/GISAID/hCoV-19_spikenuc0810/"):
    print("[Creating file...]\n[Please wait...]\n")
    #output file name
    create_file = open('/home/ankit/RWork/GISAID/hCoV-19_spikenuc0810/spikenuc0810_FINAL.fasta','a')

    for i in file.readlines():
        # print(i.strip())
        if i[0:1] == '>':
            create_file.write("\n"+i[0:header_index_upto]+"\n")
            # print(i[0:49])
        else:
            create_file.write(i.strip())
            # print(i.strip())
    create_file.close()
    print("[Done...]")

else:
    print("File already existed...\nTerminating the process")

Copy paste this code in your text editor and save the file with .py extension.

then run:

python3.9 saved_code.py
ADD COMMENT

Login before adding your answer.

Traffic: 2937 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6