Processing FASTA files
5
0
Entering edit mode
6.7 years ago
s_bio ▴ 10

Hi All,

I have a file with lots of FASTA sequences that I downloaded from NCBI. I wish to confirm the number of sequences that are present in the file. For the same, I tried counting ">" for obvious reasons using following command

fgrep -o > sequence.fasta | wc -l

However, I am not getting the desired result. Can somebody suggest me the way to do so.

Secondly, though I wish to work with Human sequences, however, my file contains some other sequences of some other organisms as well. Can somebody share command where we can just retain sequences with title Human and delete the remaining.

Thanks a lot!

R genome sequence • 1.9k views
ADD COMMENT
1
Entering edit mode

fgrep -o > sequence.fasta | wc -l

This doesn't work because the > is a special character on the command-line; it redirects standard output into a file. You have to use single or double quotations as proposed below.

ADD REPLY
0
Entering edit mode

Thanks a lot Wouter,

Your answer solved my query. However, For my second query, I tried googling Filtering FASTA, but was unable to get the answer, hence I am re posting my query with example sequences.

ADD REPLY
0
Entering edit mode

Please use ADD COMMENT or ADD REPLY to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your post but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.

If my answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. Upvote|Bookmark|Accept

Welcome to biostars. Interesting guidelines for posting can be found in the following posts:

ADD REPLY
0
Entering edit mode

seqkit seq input.fa -n | wc -l or grep \> input.fa | wc -l

ADD REPLY
2
Entering edit mode
6.7 years ago

For filtering you can do:

$ pip install pyfaidx
$ faidx -g '*Human*' input.fasta > output.fasta
ADD COMMENT
0
Entering edit mode
6.7 years ago

I'm not sure why you tagged 'R', but I would solve your first problem using

grep -c '^>' sequence.fasta

However, I am not getting the desired result.

Please be more specific with regard to the result that you get, and how that's not desired.

For your second question, can you use the fasta identifiers for filtering? I don't know what they look like. But there are tons of solutions on biostars for that, so googling a bit should get you some options.

ADD COMMENT
0
Entering edit mode
6.7 years ago

To count:

$ awk '/^>/' in.fa | wc -l

This is equivalent to:

$ awk '{ if ($0~/^>/) { print $0; }}' in.fa | wc -l

In other words, if the line starts with a >, print that line to standard output. The wc -l command counts the number of lines that awk spits out.

ADD COMMENT
0
Entering edit mode
6.7 years ago

To filter FASTA based on a sequence header pattern, you can use awk:

$ awk 'BEGIN{ RS=">"; FS="\n"; }{ if ($1~/Human/) { print RS$0; } }' in.fa > out.fa

Replace Human with your desired regular expression pattern. This approach should scale well to large inputs, I'd think, and be agnostic to variations in FASTA input formats.

ADD COMMENT
0
Entering edit mode
6.7 years ago
s_bio ▴ 10

Dear All,

Thank you so much for valuable code chunks that has helped me solve my query.

Best! s_bio

ADD COMMENT

Login before adding your answer.

Traffic: 2671 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6