Statistical
Methods
Probability theory and
Population Distribution Curves:
Normal distribution:
- 1. continuous symmetrical distribution;
- 2. mean lies at highest point of curve (as do the mode
& median if it is not skewed);
- 3. shape of curve approx. bell-shaped;
- 4. y = [1/(2pi)]exp(-x2/2)
- 5. it cannot be integrated (as is the case for most
distribution curves in statistics) thus there is no
simple formula for the probability of a random variable
lying between given limits. These probabilities (areas
under the curve) are obtained from tables.
- the mean (average):
- sum of all observations divided by the no. of
observations.
- the variance (variability):
- = sum of squares about the mean / degrees of
freedom
- = sum(xi - mean)2 / (n-1)
- = (sum(xi2) - (sum(xi)2
/ n)) / (n-1)
- as this is not in the units of the variable being
observed, nor the mean, the std.dev. is used.
- the standard deviation
- describes the dispersion or spread of data in
same units as the variable and the mean.
- = sqr.root(variance)
- If the data follows a "normal
distribution", then 95% of the group's
measurements are found within 2 st.dev. above and
below the mean.
- However, even if not normal distrib., at least
75% of all values will lie within 2 s.d. of mean,
& at least 88% will lie within 3 s.d. of the
mean.
Percentage points of the normal
distribution:
One- |
Sided |
Two- |
Sided |
P |
x |
P |
x |
50 |
0.00 |
|
|
25 |
0.67 |
50 |
0.67 |
10 |
1.28 |
|
|
5 |
1.64 |
10 |
1.64 |
2.5 |
1.96 |
5 |
1.96 |
1 |
2.33 |
|
|
0.5 |
2.58 |
1 |
2.58 |
0.1 |
3.09 |
|
|
0.05 |
3.29 |
0.1 |
3.29 |
Binomial Distribution:
- is the distribution
followed by the number of successes in n independent
trials when the probability of one trial being a success
is p
- it is a discrete
distribution
- Prob(r successes) = n!((pr)(1-p)n-r
/ (r!(n-r)!)
- eg. if probability of
surviving a disease is 0.90, & we have a sample of 20
patients, the number who survive will be a binomial
distribution with p=0.9, n = 20. The probability that all
survive (ie. r=20), is 0.12. (NB. 0! = 1)
- mean = np
- variance = np(1-p)
- for practical purposes,
generally this (and the Poisson distribution) approximate
Normal distribution if BOTH np & n(1-p) are greater
than 5.
Poisson Distribution:
- like the binomial, it is a
discrete distribution
- for events that happen
randomly & independently in time with a constant
rate, then the number of events which happens in a fixed
time interval follows the Poisson distribution
- mean = rate events happen
- variance = rate events
happen
- Prob(r events occurring in
unit time with rate m) = e-mmr /
r!, where e = 2.718;
Chi-squared
Distribution:
- chi-squared = sum(Ui2),
where
- U is a Standard
Normal variable with mean of 0 and variance of 1.
- sqr.root(chi-squared):
- mean =
sqr.root(n-1/2), where n is degrees of freedom
- variance = ~1/2
Student's t
Distribution:
Sampling:
When one samples a population according to a parameter (eg.
height), one needs to determine:
- the mode:
- the most frequently occurring value
- the median:
- the central value of the distribution
- ie. value at which 50% of population lie below =
50th quantile
- the mean (average) of the sample
- the variance (variability) of the sample
- the standard deviation of the sample
- the error range of the the sample's mean in
estimating the population's mean:
- ie. if the sample's is 5.8, how does one estimate
the range of values within which the population's
mean will fall.
- this is the standard error of
the estimate of the population mean = standard
dev. of the sample mean
- the population mean will have:
- 95% probability of lying in range
of sample mean +/- 1.96 x
std.error
- this range is called the 95%
confidence interval
- 99% probability of lying in range
of sample mean +/- 2.58 x
std.error
- this range is called the 99%
confidence interval
- the standard error can be determined by:
- to determine for a normally
distributed population:
- eg. what is the mean value of a
parameter for the whole
population
- std.error = population variance /
sq.root(sample size)
- mostly we do not know the
pop. variance but its
estimate can be used.
- to determine for a proportion of
a population:
- eg. what is the proportion of the
population with a condition
- this is a binomial distribution
- std. error = sqr.root {p(1-p)/n},
- where p = the proportion
of individuals in
population & n =
sample size
- we can estimate this by
replacing p with r/n
where r = proportion of
individuals in sample,
but this normal
distributed approximation
is ONLY if both np &
n(1-p) are greater than
5.
Differences between groups:
Assessing size of differences:
What size sample populations do you need to
get a significant difference?
- Power of a test:
- the probability that a test will produce a
significant difference at a given significance
level
- a good test has a power approaching 1.
- = 1 - P(x), where:
- P(x) is the normal distribution
probability of x, and,
- x = 1.96 - (µ1 - µ2)/se,
where
- P(1.96) = 0.95
- ie. the value where the
area under the normal
curve to the left of it =
0.95 which will give us a
significance level of p =
0.05 (for p=0.01, then
the figure to use is
2.58)
- µ1 & µ2
are the population means &,
- se is the standard error of the
difference of means
- = sqr. root {(s12/n1)+(s22/n2)},
where
- s = std.dev. of
sample of size n
- Sample size for a comparison:
- just use the above equation for the power to
solve for n
- eg. if we wish to have a probability of 90% of
finding a difference (ie. power = 0.90) with a
significance level of p=0.05, and a pilot study
indicated that the std.dev. of the population is
40units and our hypothesis is that the difference
will be at least half a std.dev. (ie. 20units)
between the 2 groups, then we have:
- P(x) = 0.10 & from normal
distribution table, x = -1.28, thus,
- -1.28 = 1.96 - 20 / (sq.root{(402/n)+(402/n)})
- => round(n)= 84
patients needed in each
group
- NB. using a significance level of p=0.05 means
that 1 in 20 independent studies of the same study will
show an incorrect result!!
Statistical significance of differences:
- Considerations:
- Comparisons
- the no. of comparisons possible increase
as the no. of groups increase.
- ie. 2 groups = 1 comparison
(pair-wise);
- 3 groups = 3 comparisons;
- 4 groups = 6 comparisons;
- Thus, the more groups, the greater the
chance of finding a statistically
significant difference.
- Matching (Pairing)
- are the individuals in the control and
study groups paired to make the groups
more uniform.
- One-tailed or two-tailed:
- a one-tailed test is used if one is only
interested in results on one side of the
mean, whereas, two-tailed test is used if
both sides of the mean are relevant.
- Data type:
- interval scales (continuous)
- the interval or distance between
points on the scale has precise
meaning , & a change in one
unit at one scale point is the
same as a change in one unit at
another
- such scales are also ordinal
- eg. temperature, time
- ordinal scales
- limited no. of categories with an
inherent ordering
of categories from lowest to
highest.
- eg. much improved, mild improved,
same, mild worse, much worse
- ordered nominal scales
- grouping subjects into several
ordered categories
- eg. group subjects by an ordinal
scale value (eg.
"improved" vs
"worse", etc)
- nominal scales
- limited no. of categories but no
inherent ordering of the
categories.
- eg. eye colour
- dichotomous scales
- subjects are grouped into 2
categories (a special case of
nominal scales)
- eg. died vs survived
Specific tests for statistical
significance of differences:
- Interval Data:
- > 50 in each sample:
- Normal Distribution for means:
- Std error of the
difference between 2 means:
- eg. mean PEFR for
children without
nocturnal cough vs with
nocturnal cough
- = sqr. root {(s12/n1)+(s22/n2)}
- the difference of the
means of each group then
has a 95% confidence
interval of being within
(mean1 - mean2)
+/- 1.96 x std.error
- if this confidence
interval DOES NOT include
zero then there IS a
difference between the 2
groups.
- Std error of the
difference between 2
proportions:
- eg. comparing the
proportions of 2 pop.ns
with a condition
- eg. people with PH
bronchiolitis vs no PH
bronchiolitis &
determine the proportion
in each group who have
asthma
- = sqr. root {(p1(1-p1)/n1)+(p2(1-p2)/n2)}
- where p = the
proportion of individuals
in population & n =
sample size
- we can estimate this
by replacing p with r/n
where r = proportion of
individuals in sample,
but this normal
distributed approximation
is ONLY if both np &
n(1-p) are greater than
5.
- the difference of the
means of each group then
has a 95% confidence
interval of being within
(mean1 - mean2)
+/- 1.96 x std.error
- if this confidence
interval DOES NOT include
zero then there IS a
difference between the 2
groups.
- < 50 in each sample, with normal
distribution:
- t Distribution for means:
- <50 in each sample, non-Normal
distribution:
- Nominal Data:
- Chi-square
- the most commonly used test for
nominal data;
- Can be used for one comparison or
modified for more than one
comparison;
- Uses a crosstabulation table and
calculates the differences between
the observed and expected values in
each cell;
- Should NOT be used if expected
numbers are small, thus:
- no. cells < 5 should be
< 20% of cells
- min. expected number = 1.
- Fischer's exact test
- for use with small numbers, unmatched
and only 1 comparison (2 groups).
- McNemar's test
- a modification of the Chi-square for
use with matched samples and large
numbers.
- Sign test
- used when numbers are too small for
McNemar's.
- Ordinal Data:
- Nonparametric tests (1 & 2 way
modifications all tests) -
- Mann-Whitney U or Median test:
- Wilcoxon matched pairs signed
ranks test:
- two groups with matched
samples;
- Kruskal-Wallis 1-way analysis of
variance:
- more than 2 groups,
unmatched.
- Friedman 2-way analysis of
variance:
- more than 2 groups with
matched samples.
- Continuous Data:
- T-test
- Matched t-test
- 2 groups with matched samples.
- F test for analysis of variance:
- more than 2 groups, unmatched.
- F test for analysis of var. with blocking
or analysis of covariance:
- more than 2 groups with matched
samples.
- Approx. t-test:
- for sample >30, mean & s.d.
available, then can rapidly assess
the significance by seeing if the
mean of one group is outside the 95%
confidence limits of the other group
- 95% conf.limits = sample mean +/- 2x
st.error mean
- st.error mean = st. dev. / sq.root
(sample size)
Associations:
- Aspects:
- 1. What is the degree or strength of the
association?
- 2. Is the association found statistically
significant?
- 3. How much of the variation in the outcome in
the study and control groups is explained by the
association?
- Nominal data:
- A. Prospective or experimental data:
- 1. Relative risk:
- measures the strength of assoc.;
- Rel.Risk = (risk if factor
present) / (risk if absent)
- 2. Stat. significance of rel. risk:
- 3. Attributable risk:
- measures how much risk is
attributable to that factor ->
a measure of benefit in removing
that factor.
- Attrib.risk = [(risk with
factor)-(risk without factor)] /
(risk without factor)
- B. Retrospective or cross-sectional data:
- Because the researcher chooses a certain
number of subjects with and without a
disease, the numbers do not reflect
natural incidence, and thus absolute and
relative risks cannot be calculated
although approximations can be
calculated.
- 1. Odds ratio:
- an approx. of rel.risk.
- Odds risk factor & dis. =
(dis & factor)/(dis no fact.)
Y
- Odds risk factor no dis.= (no dis
& fact)/(no dis,no fact) Z
- 2. Approx. attrib. risk:
- also need to measure the
prevalence of risk factor in the
population;
- C. Chi-square based measures of association:
- Each of these measures attempts to modify
Chi-sq. to minimise the influence of
sample size (N), and degrees of
freedom, as well as restrict the range of
values of measure to between 0 and 1.
- Without such adjustments, comparisons of O2
values from tables of varying dimensions
and sample size are meaningless.
- However, these measures are hard to
interpret, although, when properly
standardized, they can be used to compare
the strength of association in several
tables.
- Phi coefficient:
- For 2x2 tables; Phi need not lie
between 0 and 1 as O2
may be greater than N. To obtain
a measure between 0 and 1,
Pearson suggested the use of C.
- Phi (N)
= %[
(O2)/N
]
- Contingency coefficient (C):
- Value is always between 0 and 1,
but cannot generally attain value
of 1. Max. value possible depends
on no. of rows and columns.
- eg. 4x4 table, max. value
C = 0.87
- C = %[
(O2)/(O2
+ N) ]
- Cramer V:
- This can attain a maximum value
of 1.
- V = %[
O2/(N(k-1)
],
- where k is the smaller
no. of rows &
columns.
- D. Measures based on proportional reduction of
error (PRE):
- The meaning of the association is clearer
than Chi-sq. based measures.
- These measures are all essentially ratios
of a measure of error in predicting the
values of one variable based on knowledge
of that variable alone, and the same
measure of error applied to predictions
based on knowledge of an additional
variable.
- Goodman & Kruskal Lambda:
- Lambda x 100 = % reduction in
error when that variable is used
to predict the outcome of the
dependent variable.
- Lambda always between 0 and 1.
- Ordinal & continuous data:
- The techniques are the same for all 4 types of
study designs: prospective, retrospective,
cross-sectional & experimental.
- The fundamental techniques with these types of
data known as correlation.
- Pearson's Correlation Techniques:
- If both measurements being correlated
consist of continuous data, & a
linear relationship exists between the
variables being correlated.
- 1. Pearson's Correl. Coeff. (r) :
- the degree of association;
- range of r is -1 to +1, 0 = no
predictable change.
- 2. Stat. signif. of r:
- 3. Pearson's coeff. of determination
(r2):
- measures the extent of
association;
- Nonparametric correlation:
- Used if the data from any of the
variables is ordinal or when a linear
relationship is not suspected.
- eg. consider 2 ordered variables
a, b, and 2 cases 1,2, taken from
the sample:
- values a1, a2, b1, b2
- if a1 & b1 are both
greater (or smaller) than
a2 & b2 respectively,
then the pair of cases is
called concordant.
- if a1 > a2 and b1 <
b2, then the pair of
cases is called discordant.
- if a1 = a2, and b1 not =
b2, then it is tied
on a but not
tied on b.
- if a1 = a2, and b1 = b2,
then they are tied on
both variables.
- if there is a preponderance of concordant
pairs, then the association is said
to be positive.
- if there is a preponderance of discordant
pairs, then the association is negative.
- if no. concordant pairs = no. discordant
pairs, then it is said there is no
association.
- A. Spearman's rho:
- B. Kendall's tau a,b & c:
range -1 to +1, for tau-c;
- C. G & K's gamma:
Pr(conc.)-Pr(disc.),assume no ties;
Regression:
- A series of techniques that are useful for studying the
association between one dependant variable and many
independent variables.
- These techniques are capable of measuring:
- the strength of an association,
- the statistical significance of the association,
and,
- the extent of the variation in the dependent
variable that can be explained by the independent
variable.
- Assumptions:
- 1. For any fixed value of an independent variable
X, the distribution of the dependant variable Y
is Normal, with mean uy/x
(mean of Y for a given X) and a constant
variance of o2. They may have
different means, but same variance.
- 2. The dependent variable values are
statistically independent of each other.
- 3. The mean values uy/x
all lie on a straight line, which is the population
regression line.
- Single Independent Variable:
- Yi = bo + b1Xi
+ ei, where bo = intercept,
b1 = slope,
- ei = error or disturbance
- = diff. b/n observed Yi and
the subpop. mean at point Xi
- bo, and b1, are unknown
population parameters, and must be estimated from
the sample Bo & B1
using least-square methods.
- Testing Hypotheses:
- That there is no linear relationship b/n X
& Y:
- ie. that the slope of the pop. regression
line = 0.
- t = B1 / (st.dev. B1),
t should fit Student's t distrib. with
N-2 d.f.
- That the intercept = 0:
- t = Bo / (st.dev. Bo),
t should fit Student's t distrib. with
N-2 d.f.
- 95% Confidence interval of B1:
- 95% confidence means that, if repeated
samples are drawn from a population under
the same conditions, & 95% confidence
intervals are calculated, 95% of the
intervals will contain the unknown
parameter B1. Since the
parameter value is unknown, it is not
possible to determine whether or not a
particular interval contains it.
- Goodness of Fit:
- How well the model actually fits the
data.
- The R coefficient:
- R2 is
sometimes called the coefficient
of determination.
- R = Pearson correl. coeff.
b/n predicted Y & observed Y
- R2 = R x
R, Multiple R = Sq.Root R2.
- Adjusted R2
is a correction to more closely
reflect goodness of fit.
- If all obs. fall on regression
line, R2 = 1;
- If there is no linear
relationship b/n the dependent
& independent variables, then
R2 = 0.
- R2 = 0 does not
necessarily mean that there is no
association, but that there is no
linear relationship.
- Analysis of Variance:
- To test the hypothesis of no
linear relationship b/n X &
Y, several equivalent statistics
can be computed.
- If single independent variable:
- Hypothesis R2(pop.)
= 0, is identical to
hypothesis population
slope = 0;
- If the probability (signif.
F) associated with
the F statistic is
small, the hypothesis
that R2 = 0 is
rejected.
- Searching for violations of
assumptions:
- 1. Residuals:
- a residual is what is left after
the model is fit. The diff. b/n
an observed value & the value
predicted by the model.
- If the model is appropriate, the
observed residuals E,
which are estimates of the true
errors ei,
should have similar
characteristics, ie. normal dist.
with mean of 0 and a constant
variance.
- 2. Linearity:
- 3. Equality of variance:
- 4. Independence of error:
- 5. Normality of residuals: