headers from multifasta files
4
0
Entering edit mode
5.9 years ago

I have a folder with multifasta files and I would like to extract the headers from each one of them, I've used the following command in shell

 grep -e ">" *.fasta > prueba_nc.txt

the output looks like it

Adenoviridae_genomas.fasta:>AC_000001 [AC_000001] Ovine adenovirus A, complete genome.
Adenoviridae_genomas.fasta:>AC_000002 [AC_000002] Bovine adenovirus B, complete genome.
Adenoviridae_genomas.fasta:>AC_000003 [AC_000003] Canine adenovirus 1, complete genome.

...

and I would like to extract only the fragment after the ">"

sequence • 2.5k views
ADD COMMENT
1
Entering edit mode

I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY
1
Entering edit mode

Did you upvote just my answer? Please validate the other answers and provide feedback, if at all possible. Muchas gracias.

ADD REPLY
0
Entering edit mode

You are absolutely right Kevin, I have upvoted it ;-)

ADD REPLY
0
Entering edit mode

Okay, thanks. Did you look at the other solutions by the others?

ADD REPLY
0
Entering edit mode

Hi Ulises, although this can be easily achieved as other people has already explained, if you are working with FASTA files you may be interested in SEDA (http://www.sing-group.org/seda/). Please, take a look and feel free to contact us if you need some assistance using it. Regards.

ADD REPLY
2
Entering edit mode
5.9 years ago

Prueba del éxito:

cat and cut

cat prueba_nc.txt | cut -f2 -d'>'
AC_000001 [AC_000001] Ovine adenovirus A, complete genome.
AC_000002 [AC_000002] Bovine adenovirus B, complete genome.
AC_000003 [AC_000003] Canine adenovirus 1, complete genome.

cut

   cut -f2 -d'>' prueba_nc.txt
   AC_000001 [AC_000001] Ovine adenovirus A, complete genome.
   AC_000002 [AC_000002] Bovine adenovirus B, complete genome.
   AC_000003 [AC_000003] Canine adenovirus 1, complete genome.

AWK

awk '{print $2}' FS=">" prueba_nc.txt
AC_000001 [AC_000001] Ovine adenovirus A, complete genome.
AC_000002 [AC_000002] Bovine adenovirus B, complete genome.
AC_000003 [AC_000003] Canine adenovirus 1, complete genome.
ADD COMMENT
2
Entering edit mode
5.9 years ago

So you want to get rid of the filename?

In that case, use the -h/--no-filename option of grep.

Or you also want to get rid of the >? You could pipe the grep to sed, e.g.:

grep -he ">" *.fasta | sed 's/^>//' > prueba_nc.txt
ADD COMMENT
0
Entering edit mode
$ sed -n '/>/ s/>//p' *.fa

also works.

ADD REPLY
2
Entering edit mode
5.9 years ago
st.ph.n ★ 2.7k
for file in *.fasta; do grep -e '>' | cut -f 2 -d '>' > "`basename .fasta`.headers.txt"; done

Will take each file with .fasta extension in your cwd and grep for the headers, and cut the headers and take everything after the '>' and place them into a file named with the prefix from the original fasta file now with the extension .headers.txt

ADD COMMENT
1
Entering edit mode
5.9 years ago

Using seqkit

seqkit fx2tab -in *.fa > headers.txt
ADD COMMENT
0
Entering edit mode
$ seqkit seq -n test.fa

also works

ADD REPLY

Login before adding your answer.

Traffic: 2706 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6