Question

Any Script To Parse Fasta Headers?

4

Entering edit mode

11.2 years ago

SK ▴ 110

I have a 47GB file to parse. The sequences are in following format:

>TSCS_00041 gene0EA_12345_rframe2_ORF
MLAATHYYKFAIRRLFPLLKDTICASYSISIKHHENFMALSNMPKIWEDVEVDGNNMQWTRFQTTPVMPVYFIAAGVFNLSFITNWNTKLLYRKDILPYMTFAYNVAKNIAWFLSHIRKTKITNHI
>TSCS_00044 gene0EA_12341_rframe2_ORF
MTICASYSISIKHHENFMAIKHHENFMALSNMPKIWEDV

I simply want to format this file like:

>TSCS_00041
MLAATHYYKFAIRRLFPLLKDTICASYSISIKHHENFMALSNMPKIWEDVEVDGNNMQWTRFQTTPVMPVYFIAAGVFNLSFITNWNTKLLYRKDILPYMTFAYNVAKNIAWFLSHIRKTKITNHI
>TSCS_00044 
MTICASYSISIKHHENFMAIKHHENFMALSNMPKIWEDV

Could anyone share the script

fasta next-gen perl • 10.0k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 11.2 years ago by SK ▴ 110

5

Entering edit mode

what have you tried ? hint: 'cut'

ADD REPLY • link 11.2 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

can this be done with cut only? the OP seems to shorten the fasta header not the other lines

ADD REPLY • link 11.2 years ago by Istvan Albert 100k

5

Entering edit mode

cut -d" " -f 1 will work as long as no spaces in sequence.

ADD REPLY • link 11.2 years ago by brentp 24k

0

Entering edit mode

and that's how homework is done ;)

ADD REPLY • link 11.2 years ago by Jorge Amigo 14k

score 10 · Answer 1 · 2013-02-05

10

Entering edit mode

11.2 years ago

Frédéric Mahé ★ 3.2k

Assuming that you do not have spaces in your sequences you can try that:

sed -e '/^>/ s/ .*//' mybigfile

Awk should work too (even if your sequences contain spaces):

awk '{print /^>/ ? $1 : $0}' mybigfile

and just for fun, a pure bash version:

while read l ; do echo "${l%% *}" ; done < mybigfile

but a simple cut command would be faster:

cut -d " " -f 1 mybigfile

ADD COMMENT • link 9.4 years ago by Frédéric Mahé ★ 3.2k

0

Entering edit mode

nice, can you explain the bash version?

ADD REPLY • link 11.2 years ago by brentp 24k

2

Entering edit mode

Sure, "${l%% *}" is a parameter expansion. It says remove the longest suffix string containing a space from the variable l. So, for all lines in mybigfile, remove anything right of a space (if any) and write the result on the standard output.

ADD REPLY • link 11.2 years ago by Frédéric Mahé ★ 3.2k

0

Entering edit mode

cheers .

ADD REPLY • link 11.2 years ago by brentp 24k

score 5 · Answer 2 · 2013-02-05

5

Entering edit mode

11.2 years ago

Kenosis ★ 1.3k

Try the following:

use strict;
use warnings;

while (<>) {
    s/>\S+\K.+//;
    print;
}

Results:

>TSCS_00041
MLAATHYYKFAIRRLFPLLKDTICASYSISIKHHENFMALSNMPKIWEDVEVDGNNMQWTRFQTTPVMPVYFIAAGVFNLSFITNWNTKLLYRKDILPYMTFAYNVAKNIAWFLSHIRKTKITNHI
>TSCS_00044
MTICASYSISIKHHENFMAIKHHENFMALSNMPKIWEDV

Usage: perl script.pl inFile >outFile

As the fasta inFile is read line-by-line, the substituting regex will keep the > and the non-whitespace characters up to the first whitespace. The >outFile notation directs the printing to the file outFile.

Hope this helps!

ADD COMMENT • link 11.2 years ago by Kenosis ★ 1.3k

6

Entering edit mode

the equivalent perl-one-liner:

  perl -pe 's/>\S+\K.+//' < file.fasta > new_file.fasta

ADD REPLY • link 11.2 years ago by JC 13k

2

Entering edit mode

Actually I am not sure why you always use -plane. -p and -n contradict each other, while -a enables autosplit, which causes overhead and is not necessary in this case. -l is useless here, too. The proper one should be perl -pe 's/^>(\S+).*/>$1/' or something similar. Sed also works.

ADD REPLY • link 11.2 years ago by lh3 33k

0

Entering edit mode

You (as usual) are right, I removed the unnecessary flags. Thanks.

ADD REPLY • link 11.2 years ago by JC 13k

score 3 · Answer 3 · 2013-02-05

3

Entering edit mode

11.2 years ago

KCC ★ 4.1k

Put the following code in a file called parse.py

import sys

for line in open(sys.argv[1]):
    if line.startswith(">"):
        line = line.split()
        print line[0]
    else:
        print line.strip()

Then assuming your file is named "myfile.fa", you type:

python parse.py myfile.fa

ADD COMMENT • link 11.2 years ago by KCC ★ 4.1k

1

Entering edit mode

This is good!

Can also use a substutition w/capture here:

import sys, re

for line in open(sys.argv[1]):
    print re.sub(r'(>\S+)\s+.+', r'\1', line) or line,

Sorry, George, about deleting my earlier posting. I didn't know what I was doing...

ADD REPLY • link 11.2 years ago by Kenosis ★ 1.3k

0

Entering edit mode

if you remove "if line.startswith(">"):" from the code, the script will still work.

ADD REPLY • link 11.2 years ago by Geparada ★ 1.5k

1

Entering edit mode

Fair enough. A little more slick than I was going for, I guess. I guess I would argue the program as written is more robust, explicit and easier to tweak if you need something slightly different done later.

ADD REPLY • link 11.2 years ago by KCC ★ 4.1k

0

Entering edit mode

If you keep the "if line.startswith('>'), you can remove the first strip() function. You do not need to remove the newline character (or any trailing spaces) since you only output the first item.

ADD REPLY • link 11.2 years ago by Frédéric Mahé ★ 3.2k

0

Entering edit mode

Thanks. I modified it.

ADD REPLY • link 11.2 years ago by KCC ★ 4.1k

score 0 · Answer 4 · 2013-02-06

0

Entering edit mode

11.2 years ago

Martin A Hansen 3.0k

The Biopieces www.biopieces.org) solution:

read_fasta -i in.fna | split_vals -k SEQ_NAME -d ' ' | rename_keys -k SEQ_NAME_0,SEQ_NAME | write_fasta -o out.fna -x

ADD COMMENT • link 11.2 years ago by Martin A Hansen 3.0k