My goal is to identify the list of transmembrane genes that have atleast one domain sticking out in the extracellular matrix. My approach was to utilize the COMPARTMENTS database for it. I downloaded the knowledgebase from COMPARTMENTS. It has the following format:
ensembl_peptide_id hgnc_symbol GO GO_Type Source Evidence_Code
ENSP00000000233 ARF5 GO:0005576 Extracellular region UniProtKB IDA
ENSP00000000442 ESRRA GO:0005576 Extracellular region HPA IDA
ENSP00000001008 FKBP4 GO:0005576 Extracellular region UniProtKB IDA
ENSP00000002125 NDUFAF7 GO:0005576 Extracellular region UniProtKB IDA
ENSP00000002165 FUCA2 GO:0005576 Extracellular region UniProtKB IDA
ENSP00000002829 SEMA3F GO:0005576 Extracellular region ProtInc TAS
My approach is a pretty simple one - filter the list using the GO_Type being Plasma membrane, Cell surface, Extracellular region or extracellular matrix (these are just a few out of many possibilities). Then, filter by score>=3 or if I am being stringent then a score>=4. A score greater than 4 means it is curated, lesser the score lesser the confidence value. However, this approach seems too simplistic to me. I was also thinking of parsing the list of genes thus obtained to a domain finder. I tried the web API of SMART and it doesn't give a very data-mining friendly output.
Is there a better tool/approach that can help identify genes with domains in extracellular matrix with some confidence value?
Any thoughts would be much appreciated.