Question

Searching and filtering uniclust databases

2

Entering edit mode

4.9 years ago

max_19 ▴ 170

Hi there,

Does anyone have experience with searching or filtering uniclust databases: http://gwdu111.gwdg.de/~compbiol/uniclust/2018_08/

For example if i want to search for a particular organism? or filter for only eukaryotes? (in the uniclust30db)

I tried doing this with the mapping file that is supplied (uniclust_uniprot_mapping.tsv.gz) it has uniprot accessions for each protein, and a uniclust ID, however, I'm not sure how I can use that ID to search the actual uniclust database, or filter for particular organisms.

thanks for your help and ideas!

uniclust protein databases • 1.7k views

ADD COMMENT • link updated 4.9 years ago by AK ★ 2.2k • written 4.9 years ago by max_19 ▴ 170

score 2 · Answer 1 · 2019-05-27

Hi max_19,

I think you can first get the ID list of particular organisms and use that to search on the header of uniclust30_2018_08/uniclust30_2018_08_consensus.fasta. The header looks like (Members contains the information you need here):

uc30-1808-83688326|Representative=A0A0D6LSX8 n=28 Descriptions=[Uncharacterized protein|Twk-43 (Fragment)|TWiK family of potassium channels protein 9|Twk-9|Protein CBR-TWK-9|Ion channel] Members=A0A2G5TZA1,A0A2A6BWN3,E3N9Z5,H3EBY7,A0A061AD18,A0A182E8X8,A0A2A2JAF3,A0A0B2UVL7,A0A2A6CBY6,A0A016U7K0,A8Y2T1,A0A1I8AN73,A0A2A2LWD0,A0A0D6LSX8,A0A0C2GMZ3,A0A1I8AAQ9,E3N9Z7,A0A0C2CU15,A0A2P4W1B0,A0A016U896,A0A2P4W1B3,A0A0B1TTC8,A0A016U8H3,H3F3P7,Q23435,A0A2K6W7A5,A0A2H2IN74,A0A0R3S4C4

For instance:

# From https://www.uniprot.org/taxonomy/2759 we know that the "Taxon identifier" is 2759 for Eukaryota
# Here we take the first 10 as an example
curl -s "https://www.uniprot.org/uniprot/?query=taxonomy:2759&format=tab&columns=id" \
  | grep -v '^Entry' \
  | head \
  > eukaryota_head10.txt

# Get the whole list of IDs from uniclust30
seqkit fx2tab --name uniclust30_2018_08/uniclust30_2018_08_consensus.fasta \
  > uniclust30_2018_08_consensus_IDs.txt

# Search for the exact match of the desired IDs (here the IDs from Eukaryota) and extract the matches
grep -w -f eukaryota_head10.txt uniclust30_2018_08_consensus_IDs.txt \
  | cut -d" " -f1 \
  | sort -u \
  > uniclust30_2018_08_consensus_IDs_eukaryota_head10.txt

# Subset uniclust30 using the list
seqkit grep --delete-matched -f uniclust30_2018_08_consensus_IDs_eukaryota_head10.txt uniclust30_2018_08/uniclust30_2018_08_consensus.fasta \
  > uniclust30_2018_08_consensus_eukaryota_head10.fasta

Hope it helps.