Question

headers from multifasta files

0

Entering edit mode

5.9 years ago

ulises.rodriguez • 0

I have a folder with multifasta files and I would like to extract the headers from each one of them, I've used the following command in shell

 grep -e ">" *.fasta > prueba_nc.txt

the output looks like it

Adenoviridae_genomas.fasta:>AC_000001 [AC_000001] Ovine adenovirus A, complete genome.
Adenoviridae_genomas.fasta:>AC_000002 [AC_000002] Bovine adenovirus B, complete genome.
Adenoviridae_genomas.fasta:>AC_000003 [AC_000003] Canine adenovirus 1, complete genome.

...

and I would like to extract only the fragment after the ">"

sequence • 2.5k views

ADD COMMENT • link updated 5.9 years ago by lakhujanivijay 5.8k • written 5.9 years ago by ulises.rodriguez • 0

1

Entering edit mode

I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY • link 5.9 years ago by WouterDeCoster 47k

1

Entering edit mode

Did you upvote just my answer? Please validate the other answers and provide feedback, if at all possible. Muchas gracias.

ADD REPLY • link 5.9 years ago by Kevin Blighe 87k

0

Entering edit mode

You are absolutely right Kevin, I have upvoted it ;-)

ADD REPLY • link 5.9 years ago by Hugo ▴ 380

0

Entering edit mode

Okay, thanks. Did you look at the other solutions by the others?

ADD REPLY • link 5.9 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi Ulises, although this can be easily achieved as other people has already explained, if you are working with FASTA files you may be interested in SEDA (http://www.sing-group.org/seda/). Please, take a look and feel free to contact us if you need some assistance using it. Regards.

ADD REPLY • link 5.9 years ago by Hugo ▴ 380

score 2 · Answer 1 · 2018-05-14

Prueba del éxito:

cat and cut

cat prueba_nc.txt | cut -f2 -d'>'
AC_000001 [AC_000001] Ovine adenovirus A, complete genome.
AC_000002 [AC_000002] Bovine adenovirus B, complete genome.
AC_000003 [AC_000003] Canine adenovirus 1, complete genome.

cut

   cut -f2 -d'>' prueba_nc.txt
   AC_000001 [AC_000001] Ovine adenovirus A, complete genome.
   AC_000002 [AC_000002] Bovine adenovirus B, complete genome.
   AC_000003 [AC_000003] Canine adenovirus 1, complete genome.

AWK

awk '{print $2}' FS=">" prueba_nc.txt
AC_000001 [AC_000001] Ovine adenovirus A, complete genome.
AC_000002 [AC_000002] Bovine adenovirus B, complete genome.
AC_000003 [AC_000003] Canine adenovirus 1, complete genome.

score 2 · Answer 2 · 2018-05-14

2

Entering edit mode

5.9 years ago

WouterDeCoster 47k

So you want to get rid of the filename?

In that case, use the -h/--no-filename option of grep.

Or you also want to get rid of the >? You could pipe the grep to sed, e.g.:

grep -he ">" *.fasta | sed 's/^>//' > prueba_nc.txt

ADD COMMENT • link 5.9 years ago by WouterDeCoster 47k

0

Entering edit mode

$ sed -n '/>/ s/>//p' *.fa

also works.

ADD REPLY • link 5.9 years ago by cpad0112 21k

score 2 · Answer 3 · 2018-05-14

for file in *.fasta; do grep -e '>' | cut -f 2 -d '>' > "`basename .fasta`.headers.txt"; done

Will take each file with .fasta extension in your cwd and grep for the headers, and cut the headers and take everything after the '>' and place them into a file named with the prefix from the original fasta file now with the extension .headers.txt

score 1 · Answer 4 · 2018-05-15

1

Entering edit mode

5.9 years ago

lakhujanivijay 5.8k

Using seqkit

seqkit fx2tab -in *.fa > headers.txt

ADD COMMENT • link 5.9 years ago by lakhujanivijay 5.8k

0

Entering edit mode

$ seqkit seq -n test.fa

also works

ADD REPLY • link 5.9 years ago by cpad0112 21k