Determine Whether A Gene Product Is A Transcription Factor
9
17
Entering edit mode
13.7 years ago
Mike Dewar ★ 1.6k

I fear this question may be terribly basic. I've been asked for a list of transcription factors that are differentially expressed in an experiment. Finding differentially expressed genes is fine, but deciding which are TFs is proving a tad elusive.

The list I produced I generated by looking for the GO biological process "regulation of transcription" for each gene using biomart. If this phrase appeared somewhere I returned it, and if it didn't I filtered it out.

I got back an email saying "no way are some of these things TFs", which was frustrating. So now I'm looking for the GO molecular function "transcription factor activity" but suddenly have two questions:

  1. Is this any more likely to correspond to what a biologist is looking for as a TF?
  2. If a gene has a term that is more specific than "transcription factor activity", is there any way to see if its parent term is "transcription factor activity"?

If these questions, which I know are really basic, can be answered by a handy function in R like `is.TF(genesymbol)', that would be awesome.

transcription gene • 14k views
ADD COMMENT
1
Entering edit mode

Mike: You may remove the beginner tag, IMHO it is not a beginner level question. It is a real use-case for integrated bioinformatics datamining approach.

ADD REPLY
0
Entering edit mode

I would change the title of this question to "Determine Whether A Gene is A Transcription Factor". Do you agree?

ADD REPLY
0
Entering edit mode

@giovanni: you are a moderator, so IMHO you can go ahead and make questions clearer. See the "Other people can edit my stuff?!" section of the SO FAQ: http://stackoverflow.com/faq

ADD REPLY
0
Entering edit mode

Hi Michael, thank you but I was just asking for a confirmation, since I don't know if I understood well the question.

ADD REPLY
0
Entering edit mode

How did you produce your list? I've just looked up Stat1 in EnsEMBL, in biomart, at MGI and on the GO website. All of them have it annotated as a trsncription factor.

ADD REPLY
0
Entering edit mode

@giovanni - thanks for the edit! Always good to have the question cleaned up!

ADD REPLY
0
Entering edit mode

@Keith - you're right. I think I'm going to remove the list as it's now distracting rather than clarifying

ADD REPLY
4
Entering edit mode
13.7 years ago

To improve your results, I would filter the genes that also are associated with the term 'DNA binding activity', because all the transcription factors binds the DNA by definition. In fact, a gene that is associated with "Regulation of transcription factor activity" is not necessarily a transcription factor itself, as a gene may interact with other TFs and regulate their activity without actually being a TF itself.

The problem that you have pointed with STAT1 is due to the annotation in GeneOntology and it is not possible for you to solve it. There is an old topic where we have discussed [how much one can trust the GeneOntology's annotations]. The annotations on GeneOntology are good but are not complete, and there are really a lot of false negatives, and some false positives. The only thing that you can do, when you find a gene that should be associated with a term but it is not, is to go to the GO's bug tracker on sourceforge and report the case to the maintainers. They answer very quickly, and in one or two days (maybe more, since we are in august) they will explain you why STAT1 is not associated with that term.

ADD COMMENT
0
Entering edit mode

Mouse Stat1 is annotated directly to GO:0003700 (transcription factor activity) at both MGI and EnsEMBL (the source of the biomart an annotation). I don't think the problem is with the GO.

ADD REPLY
0
Entering edit mode

@Keith yep - my mistake!

ADD REPLY
0
Entering edit mode

@giovanni - as Keith has pointed out the thing about STAT1 is a mistake on my part, not GO's. Though I think your point is still valid! So are you saying that I should look for "DNA binding activity" AND "Regulation of transcription factor activity"?

ADD REPLY
0
Entering edit mode

It is always good to have a list of genes to be used as controls or test cases for your analysis. You should identify some genes that you are sure to be TF, and some genes that you are sure they are not, and use them to evaluate the effectiveness of your pipeline. In any case, yes, I think you could try "DNA binding activity" AND "Regulation of transcription factor activity", and maybe mark as 'possible TF' the genes that are associated with only one of these terms.

ADD REPLY
3
Entering edit mode
13.7 years ago
brentp 24k

I would not rely on GO annotations. Take a look at (for example, from a quick google search) this paper in nature reviews genetics 2009 which says that:

Further analysis using the GO database (Fig. 1b) showed that most human TFs are unannotated, indicating that they remain uncharacterized

They also provide a list of semi-curated loci that encode TF's. I don't work with human, but there may be other sources worth looking into.

ADD COMMENT
0
Entering edit mode

Blimey. So my immediate problem isn't that there are lots of TFs that I don't catch because they're unannotated, it's picking those that are annotated from all non-TF genes. This lack of annotation will undoubtedly become important though....

ADD REPLY
0
Entering edit mode

Brent: Thanks a lot for sharing this paper. Interesting paper. Mike: You may noticed that, this article described the TFs based on InterPro domains. I strongly recommend to get a library of domains for mouse and do hmmpfam/interproscan search on your sequence and consider GO as additional level of annotation.

ADD REPLY
3
Entering edit mode
13.7 years ago

I am afraid, there is no single tool/method which can tell you whether a given gene is a TF or not. One option is to consult a database of transcription factors for example DBD (Mouse TFs are available here) or to use a learning algorithm which can predict it from sequence (not sure if such an algorithm exist).

If the genes are small in number I would recommend to use a two-step integrated search with Pfam domain architecture and GO rather than relying only on GO terms. First, I would look for the protein domain architecture of the hits. For example STAT1(Wikipedia, Uniprot), I can get the Pfam page here. This protein encodes distinct domains from the members of protein domain family of transcription factors (STAT_int, STAT_alpha and STAT_bind ). As Pfam domains are assigned based on the sequence properties, these predictions are reliable. You may also consult GeneRIF of your genes (GeneRIF for STAT1). They are automatically curated gene related information from literature.

Get the gene ID

Map to Uniprot

Search in Pfam

Get Pfam based protein domain architecture

Check if any of the Pfam-A domains assigned to sequence is part of family/families of transcription factors

As DBD is updated in 2008, I would recommend you to use this approach to get up-todate result. Using Perl(or any language of your choice) you can automate this as an entire workflow via ID mapping.

Giovanni provided a good overview of why a gene product with annotation "Regulation of transcription factor activity" is not a TF. Also he pointed some of the current issue with GO, but IMHO irrespective of all these odds GO is the best resource to understand function of a group of genes. The GO approach is much better than reading gene descriptions from the manuscripts and to decide about the potential function of the genes.

ADD COMMENT
3
Entering edit mode
13.7 years ago

Try with Biobase transfact

Their database is manually curated. Meaning that there are people that read papers and enter information related to transcription factors...

Biobase is a company, so you probably need to pay a subscription to actually access the information, but you should be able to understand if you can have such a list before paying. I guess if you write them, they will tell you how and what to do and costs. Of course it is not you who have to pay ;)

Otherwise ask the "biologist" how she/he would decide if a gene is a transcription factor. If she/he realize it is not possible (because the information is not out there) will be happy to find a second best approximation. Otherwise she/he will point you to the right solution (or what will judge as right)

ADD COMMENT
0
Entering edit mode

I would have considered Biobase for such an analysis, if they have a free-full academic version. AFAIK, their academic version is pretty old and it will be a great limitation of these type of analysis. Function annotations in biology is highly temporal these days.

ADD REPLY
0
Entering edit mode

I second Khader's opinion. The free academic version of Transfac is too old to be of much value these days.

ADD REPLY
3
Entering edit mode
13.7 years ago

I would probably rely on a combination of two GO terms to extract a reasonably confident set of transcription factors, namely GO:0003677 "DNA binding" and GO:0006355 "regulation of transcription, DNA-dependent".

After doing back-tracking of the explicitly annotated terms in the GO directed acyclic graph (DAG), I find about 2500 genes annotated with each of the two terms and about 1700 annotated with both terms. That should be a pretty good starting point.

If I should improve it further, I would follow it up with a domain analysis using SMART or PFam. I would then compile a list of domains typically found in TFs, and use that to identify likely false positives on the list as well as possible false negatives in the rest of the genome. I am not sure it would be worth the extra effort, though.

ADD COMMENT
2
Entering edit mode

That is precisely why I combine the two terms - what I'm saying is that something is a TF if it is both DNA binding AND involved in regulation of transcription. Neither term alone is sufficient.

ADD REPLY
0
Entering edit mode

IMHO, using GO term "DNA binding" may not fetch exact results. DNA binding is not exclusive to TFs, this can include histones, several enzymes which may not perform a role in transcription as such. etc (Ref. http://en.wikipedia.org/wiki/DNA-binding_protein). Please let me know if am missing something.

ADD REPLY
0
Entering edit mode

Lars, Thanks for sharing your thoughts.

ADD REPLY
3
Entering edit mode
13.7 years ago
hurfdurf ▴ 490

Another reference that might be of use is [?]ORegANNO[?]. Databases are free and downloadable with lots of cross references. Last updates were as of late 2009 according to their front page though. I haven't checked the mailing list to see how active it is, but I believe their goal was to be a LGPL open version of TRANSFAC.

ADD COMMENT
1
Entering edit mode

PAZAR integrates data from ORegANNO.

ADD REPLY
3
Entering edit mode
12.1 years ago

An important resource that seems to be missing from the answers so far is TFCat. It comes from the same group responsible for PAZAR. From their front page "TFCat is a curated catalog of mouse and human transcription factors (TF) based on a reliable core collection of annotations obtained by expert review of the scientific literature." Using their Data Download tool you can quickly get to a list of high quality annotated TFs with corresponding pubmed ids.

ADD COMMENT
2
Entering edit mode
13.7 years ago
Gareth Palidwor ★ 1.6k

I've had good results using proteins annotated with GO:0003677 (DNA binding) and GO:0003700 (Transcription Factor activity). Looking at domains can be helpful but may not provide any additional information as GO annotations may be derived from the domains.

ADD COMMENT
1
Entering edit mode

I would kindly disagree with the usage of GO term "DNA binding" : if Mike is only interested in TFs. DNA binding is not exclusive to TFs, this can include histones, several enzymes etc.

ADD REPLY
0
Entering edit mode

That's why I used DNA binding and TF activity together.

ADD REPLY
2
Entering edit mode
13.7 years ago

[?]

link to PAZAR

ADD COMMENT

Login before adding your answer.

Traffic: 1868 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6