Question

Processing FASTA files

0

Entering edit mode

6.7 years ago

s_bio ▴ 10

Hi All,

I have a file with lots of FASTA sequences that I downloaded from NCBI. I wish to confirm the number of sequences that are present in the file. For the same, I tried counting ">" for obvious reasons using following command

fgrep -o > sequence.fasta | wc -l

However, I am not getting the desired result. Can somebody suggest me the way to do so.

Secondly, though I wish to work with Human sequences, however, my file contains some other sequences of some other organisms as well. Can somebody share command where we can just retain sequences with title Human and delete the remaining.

Thanks a lot!

R genome sequence • 1.9k views

ADD COMMENT • link 6.7 years ago by s_bio ▴ 10

1

Entering edit mode

fgrep -o > sequence.fasta | wc -l

This doesn't work because the > is a special character on the command-line; it redirects standard output into a file. You have to use single or double quotations as proposed below.

ADD REPLY • link 6.7 years ago by novice ★ 1.1k

0

Entering edit mode

Thanks a lot Wouter,

Your answer solved my query. However, For my second query, I tried googling Filtering FASTA, but was unable to get the answer, hence I am re posting my query with example sequences.

ADD REPLY • link 6.7 years ago by s_bio ▴ 10

0

Entering edit mode

Please use ADD COMMENT or ADD REPLY to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your post but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.

If my answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. Upvote|Bookmark|Accept

Welcome to biostars. Interesting guidelines for posting can be found in the following posts:

ADD REPLY • link 6.7 years ago by WouterDeCoster 47k

0

Entering edit mode

seqkit seq input.fa -n | wc -l or grep \> input.fa | wc -l

ADD REPLY • link 6.7 years ago by cpad0112 21k

score 2 · Answer 1 · 2017-07-25

2

Entering edit mode

6.7 years ago

Matt Shirley 10k

For filtering you can do:

$ pip install pyfaidx
$ faidx -g '*Human*' input.fasta > output.fasta

ADD COMMENT • link 6.7 years ago by Matt Shirley 10k

score 0 · Answer 2 · 2017-07-25

I'm not sure why you tagged 'R', but I would solve your first problem using

grep -c '^>' sequence.fasta

However, I am not getting the desired result.

Please be more specific with regard to the result that you get, and how that's not desired.

For your second question, can you use the fasta identifiers for filtering? I don't know what they look like. But there are tons of solutions on biostars for that, so googling a bit should get you some options.

score 0 · Answer 3 · 2017-07-25

0

Entering edit mode

6.7 years ago

Alex Reynolds 35k

To count:

$ awk '/^>/' in.fa | wc -l

This is equivalent to:

$ awk '{ if ($0~/^>/) { print $0; }}' in.fa | wc -l

In other words, if the line starts with a >, print that line to standard output. The wc -l command counts the number of lines that awk spits out.

ADD COMMENT • link 6.7 years ago by Alex Reynolds 35k

score 0 · Answer 4 · 2017-07-25

To filter FASTA based on a sequence header pattern, you can use awk:

$ awk 'BEGIN{ RS=">"; FS="\n"; }{ if ($1~/Human/) { print RS$0; } }' in.fa > out.fa

Replace Human with your desired regular expression pattern. This approach should scale well to large inputs, I'd think, and be agnostic to variations in FASTA input formats.

score 0 · Answer 5 · 2017-07-26

0

Entering edit mode

6.7 years ago

s_bio ▴ 10

Dear All,

Thank you so much for valuable code chunks that has helped me solve my query.

Best! s_bio

ADD COMMENT • link 6.7 years ago by s_bio ▴ 10