Question

Statistical Analysis Of Protein Sequence Properties

2

Entering edit mode

13.4 years ago

User 0063 ▴ 240

Hi all,

I've 80 homolog sequences. All the sequences have a structural domain. This domain in same cases have a different length. I'd like to perform a statistical analysis to find out correlation between domain length, sequence length, polar amino acid percentage, basic amino acid percentage, hydrophobic amino acid percentage.

Which statistical test could I use? Could you give me any suggestion about the way to perfom this analysis?

Thank in advance

statistics sequence analysis protein domain • 4.6k views

ADD COMMENT • link updated 13.4 years ago by Alastair Kerr 5.3k • written 13.4 years ago by User 0063 ▴ 240

1

Entering edit mode

"This domain in same cases have a different length" makes no sense. Do you mean "some cases"?

ADD REPLY • link 13.4 years ago by Neilfws 49k

0

Entering edit mode

please change the title of this question. The title of a topic should be such that other persons can understand what you are asking without being forced to open and read the whole question. I can change the title for you, but I prefer if you do it by yourself.

ADD REPLY • link 13.4 years ago by Giovanni M Dall'Olio 28k

Ram · Answer 1 · 2010-12-10

Simple things to get you started:

Start by plotting the data to look for outliers and trends. Most of us would recommend the R statistical package. To get started with R (and statistics in general), I suggest Introductory Statistics with R by Peter Dalgaard. See also this thread.
As for specific tests, consider using Spearman rank correlation to test for relationships between variables. In R, look at the cor.test() method. Alternately, consider performing anova tests to test for relationships between variables. E.g. in R:
```
plot(seqlen, pa.percent)
cor.test(seqlen,pa.percent,method="spearman")
anova(lm(seqlen~pa.perecent))
```

Dalgaard's book will help you interpret the results of these tests, though a proper grounding in the fundamentals of statistics is more important than the particular tool you use.

Ram · Answer 2 · 2010-12-10

As Alastair says, this is a multivariate problem (80 observations x at least 5 variables) and you need to do some exploratory data analysis first.

I'll assume that you are able to calculate the parameters that you described and output a simple data file in e.g. CSV format.

Principal components analysis is a good starting point; it will tell you which factors contribute most to the observed variance. Using R, you'd simply read the CSV file into a data frame and use one of prcomp() or princomp. You could then do, for example, a biplot and see how the observations cluster. In R, I'd also recommend the seqinR package, which contains many methods for sequence analysis.

From there, you'll need to develop some hypotheses that you can test. Would you expect certain factors to correlate, given what you know about protein properties, and why?

If any of this is not familiar to you - and I suspect by the question it is not - you must seek advice from a statistician and/or teach yourself some basic statistics. Blindly applying methods that you don't understand is not the way to go.

Neilfws · Answer 3 · 2010-12-10

3

Entering edit mode

13.4 years ago

Alastair Kerr 5.3k

Given you have multiple values per gene Principal Component Analysis [PCA] or correspondence analysis would be a good bet. Just comparing domain length to each of the other values will run the risk of finding a secondary correlation that is a result of a primary trend between your other variables and not domain length.

ADD COMMENT • link updated 13.4 years ago by Neilfws 49k • written 13.4 years ago by Alastair Kerr 5.3k