Biostar Beta. Not for public use.
Question: Where can I download GO terms and their associated E. coli genes?
0
Entering edit mode

I'm trying to download a flat file that has the following info:

  1. GOTERM
  2. GOTERM DESCRIPTION
  3. GOTERM SET (Biological process, molecular functions, cellular components)
  4. GENE LIST (either in EcoCyc (e.g. EG10894) , Uniprot (e.g. P0A8V2), or Blattner (e.g. b3987).

Preferably a flat file that I could download from a website but open to Python or R as well.

I have access to EcoCyc flat files but I can't find anything about GO terms in there; though, they are on the website.

Does anyone know where/how I can do this?

ADD COMMENTlink 9 months ago O.rka • 110 • updated 9 months ago EagleEye 6.4k
1
Entering edit mode
0

Thanks for this. GeneSCF looks like a good but the formatting is extremely unusual. For example GO:0000049tyajQ,tsaC,trmA,selB,truA,trmO,rlmN,dusC,tmcA,truB,arfA,thiI,rplP,epmA,lysS,lysU,tmolecular_function~ there seems to be multiple delimiteds like t and ,. I could create a parser for this but I don't want to create an error not knowing all of the rules (as this is only a single case). Is there a way to get this into a more consistent format that I could load into a dataframe?

ADD REPLYlink 9 months ago
O.rka
• 110
Entering edit mode
0

Hi,

Glad that GeneSCF was helpful. The downloaded file with 'prepare_database' format follows these rules,

GOID1~GONAME1<TAB>Gene1,Gene2
GOID2~GONAME2<TAB>Gene1,Gene2
ADD REPLYlink 9 months ago
EagleEye
6.4k
Entering edit mode
0

There is a t instead of tab character on my download but I think it should be ok. Are GO characters always 7 digits?

ADD REPLYlink 9 months ago
O.rka
• 110
Entering edit mode
1

Are GO characters always 7 digits? YES

There is a t instead of tab character on my download but I think it should be ok.

Warning: Make sure to check system requirements to run GeneSCF. GeneSCF only works on Linux system, it has been successfully tested on Ubuntu, Mint,Cent OS and Windows 10 bash (version 1607 and above). Other distributions of Linux might work as well.

I just downloaded fresh version of GeneSCF and verified if there are any problem as you mentioned. I am not able to reproduce your error or misformat issues. I am attaching the screenshot of downloaded sample results from GeneSCF for ecocyc.

./prepare_database -db=GO_all -org=ecocyc
Downloading GO database....
Extracting ecocyc information...
Updating gene information...
Do not panic. The processing is going on...
Database retreived..You are now ready to use geneSCF with organism ecocyc from --database GO
Done....Mon May 27 22:52:47 CEST 2019

enter image description here

ADD REPLYlink 9 months ago
EagleEye
6.4k
Entering edit mode
1

Thanks again, this is extremely helpful. I ran it on OSX but I will run in again when I get to lab tomorrow on my Linux machine.

ADD REPLYlink 9 months ago
O.rka
• 110
Entering edit mode
0

Yes, that will solve the issue.

ADD REPLYlink 9 months ago
EagleEye
6.4k
0
Entering edit mode

A tricky way but works well using R:

> library(tidyverse)
> library(GO.db)
> uniprot_id2go <-
+   read_tsv("https://www.uniprot.org/uniprot/?query=organism:83333&format=tab&columns=id,go-id") %>%
+   separate_rows(., `Gene ontology IDs`, sep = "; ") %>%
+   as.data.frame()
Parsed with column specification:
cols(
  Entry = col_character(),
  `Gene ontology IDs` = col_character()
)
> uniprot_id2go$Desc <- Term(uniprot_id2go$`Gene ontology IDs`)
> uniprot_id2go$Ontology <- Ontology(uniprot_id2go$`Gene ontology IDs`)
> str(uniprot_id2go)
'data.frame':   21436 obs. of  4 variables:
 $ Entry            : chr  "P07813" "P07813" "P07813" "P07813" ...
 $ Gene ontology IDs: chr  "GO:0002161" "GO:0004823" "GO:0005524" "GO:0005829" ...
 $ Desc             : chr  "aminoacyl-tRNA editing activity" "leucine-tRNA ligase activity" "ATP binding" "cytosol" ...
 $ Ontology         : chr  "MF" "MF" "MF" "CC" ...
> uniprot_id2go %>% filter(Entry == "P0A8V2")
   Entry Gene ontology IDs                                       Desc Ontology
1 P0A8V2        GO:0003677                                DNA binding       MF
2 P0A8V2        GO:0003899 DNA-directed 5'-3' RNA polymerase activity       MF
3 P0A8V2        GO:0005737                                  cytoplasm       CC
4 P0A8V2        GO:0005829                                    cytosol       CC
5 P0A8V2        GO:0006351               transcription, DNA-templated       BP
6 P0A8V2        GO:0016020                                   membrane       CC
7 P0A8V2        GO:0032549                     ribonucleoside binding       MF

Hope it helps.

ADD COMMENTlink 9 months ago SMK ♦ 1.3k

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0