How to retrieve single protein fasta file for multiple species?
2
0
Entering edit mode
6.2 years ago
arsilan324 ▴ 90

Hi all,

We are trying to make protein database of multiple organisms say E. coli, T. ferroxidans, B. subtilus, etc. This is what we want to use for matching our orbitrap output and we want to do that only with those species which we have found through Illumina sequencing. These are approximately 400+ genera. So, can you suggest any smart way of doing so? Like I provide the names of organisms and retrieve single fasta file?

Thank you very much!

FASTA Protein Multiple_Species Database • 2.4k views
ADD COMMENT
0
Entering edit mode

You can use @5heikii's script here.

cating the individual fasta genome proteins files into a giant one afterwards should be a simple task.

Note: See new answer/commnet below.

ADD REPLY
0
Entering edit mode

running this code didn't generate any fasta file. Although both the list of species (species.txt) and assembly_summary.txt are is same folder. Am i missing something?

ADD REPLY
2
Entering edit mode
6.2 years ago
GenoMax 141k

Try this if you need RefSeq (modified version of @5heikki's code):

$ more species.txt 
Bifidobacterium adolescentis

$ wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt

$ IFS=$'\n'; for next in $(cat species.txt); do awk -v SPECIES=^"$next" 'BEGIN{FS="\t"}{if($8 ~ SPECIES){print $20}}' assembly_summary_refseq.txt | awk 'BEGIN{OFS=FS="/"}{print "wget "$0,$NF"_protein.faa.gz"}'; done | sh

Otherwise

$ wget ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt

 IFS=$'\n'; for next in $(cat species.txt); do awk -v SPECIES=^"$next" 'BEGIN{FS="\t"}{if($8 ~ SPECIES){print $20}}' assembly_summary.txt | awk 'BEGIN{OFS=FS="/"}{print "wget "$0,$NF"_protein.faa.gz"}'; done

You will get many strains etc by this method. If you need very specific strains then you could awk '{print $8,$9,$10}' assembly_summary.txt > species and only take those that you need.

ADD COMMENT
0
Entering edit mode

thanks!! this worked perfectly. I have list of files such as GCF_000164035.1_ASM16403v1_protein.faa.gz and the next step would be to combine them together. Can you guide me there as well? Thanks a lot!!! :)

ADD REPLY
1
Entering edit mode

If you want the final data file uncompressed: zcat G*.gz > final.faa
If you want to keep the final data compressed: cat G*.gz > final.faa.gz

ADD REPLY
0
Entering edit mode

I have prepared another list of archea this time but this command is not working. Is there any other assembly summary for archea?

ADD REPLY
0
Entering edit mode

Post examples of names that are not working.

ADD REPLY
0
Entering edit mode

Here are examples, 1- Halodesulfurarchaeum formicicum 2- Methanosphaera cuniculi

The whole list can be seen here...

https://gold.jgi.doe.gov/organisms?Organism.Domain=ARCHAEAL&Organism.Type%20Strain=Yes&Organism.Active=Yes

ADD REPLY
0
Entering edit mode

First one should work: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/886/955/GCF_001886955.1_ASM188695v1/GCF_001886955.1_ASM188695v1_protein.faa.gz

Second does not have a refseq genome. You may have to try second option of plain genomes. These may only have genomic sequence at times. https://www.ncbi.nlm.nih.gov/protein/?term=txid1077256[Organism:noexp]

ADD REPLY
1
Entering edit mode
6.2 years ago

If you are working with UniProt, you can retrieve the data programmatically as described here (with code examples): https://www.uniprot.org/help/api_downloading https://www.uniprot.org/help/api_queries

ADD COMMENT

Login before adding your answer.

Traffic: 3788 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6