Introduction
Probability theory and Population Distribution Curves:
Normal distribution
1. continuous symmetrical distribution;
2. mean lies at highest point of curve (as do the mode & median if it is not skewed);
3. shape of curve approx. bell-shaped;
4. y = [1/(2pi)]exp(-x2/2)
5. it cannot be integrated (as is the case for most distribution curves in statistics) thus there is no simple formula for the probability of a random variable lying between given limits. These probabilities (areas under the curve) are obtained from tables.
Percentage points of the normal distribution:
| One-sided | | Two-sided | |
| P | x | P | x |
| 50 | 0.00 | | |
| 25 | 0.67 | 50 | 0.67 |
| 10 | 1.28 | | |
| 5 | 1.64 | 10 | 1.64 |
| 2.5 | 1.96 | 5 | 1.96 |
| 1 | 2.33 | | |
| 0.5 | 2.58 | 1 | 2.58 |
| 0.1 | 3.09 | | |
| 0.05 | 3.29 | 0.1 | 3.29 |
Binomial Distribution
is the distribution followed by the number of successes in n independent trials when the probability of one trial being a success is p
it is a discrete distribution
Prob(r successes) = n!((pr)(1-p)n-r / (r!(n-r)!)
eg. if probability of surviving a disease is 0.90, & we have a sample of 20 patients, the number who survive will be a binomial distribution with p=0.9, n = 20. The probability that all survive (ie. r=20), is 0.12. (NB. 0! = 1)
mean = np
variance = np(1-p)
for practical purposes, generally this (and the Poisson distribution) approximate Normal distribution if BOTH np & n(1-p) are greater than 5.
Poisson Distribution:
like the binomial, it is a discrete distribution
for events that happen randomly & independently in time with a constant rate, then the number of events which happens in a fixed time interval follows the Poisson distribution
mean = rate events happen
variance = rate events happen
Prob(r events occurring in unit time with rate m) = e-mmr / r!, where e = 2.718;
Chi-squared Distribution:
Student's t Distribution:
Sampling
When one samples a population according to a parameter (eg. height), one needs to determine:
the mode:
the median:
the mean (average) of the sample
the variance (variability) of the sample
the standard deviation of the sample
the error range of the the sample's mean in estimating the population's mean:
ie. if the sample's is 5.8, how does one estimate the range of values within which the population's mean will fall.
this is the standard error of the estimate of the population mean = standard dev. of the sample mean
the standard error can be determined by:
Differences between groups:
Assessing size of differences:
Mean (average)
arithmetic mean = sum of all values divided by number of values
geometric mean of two values = sqrt(value1 x value2)
harmonic mean of two values = 2 x (value1 x value2) / (value1 + value2)
What size sample populations do you need to get a significant difference?
Statistical significance of differences:
Specific tests for statistical significance of differences
Interval Data:
> 50 in each sample:
< 50 in each sample, with normal distribution:
<50 in each sample, non-Normal distribution:
Nominal Data:
Chi-square
the most commonly used test for nominal data;
Can be used for one comparison or modified for more than one comparison;
Uses a crosstabulation table and calculates the differences between the observed and expected values in each cell;
Should NOT be used if expected numbers are small, thus:
Fischer's exact test
McNemar's test
Sign test
Ordinal Data:
Continuous Data:
Associations:
Regression:
A series of techniques that are useful for studying the association between one dependant variable and many independent variables.
These techniques are capable of measuring:
the strength of an association,
the statistical significance of the association, and,
the extent of the variation in the dependent variable that can be explained by the independent variable.
Assumptions:
1. For any fixed value of an independent variable X, the distribution of the dependant variable Y is Normal, with mean uy/x (mean of Y for a given X) and a constant variance of o2. They may have different means, but same variance.
2. The dependent variable values are statistically independent of each other.
3. The mean values uy/x all lie on a straight line, which is the population regression line.
Single Independent Variable:
Yi = bo + b1Xi + ei, where bo = intercept, b1 = slope,
ei = error or disturbance
bo, and b1, are unknown population parameters, and must be estimated from the sample Bo & B1 using least-square methods.
Testing Hypotheses:
That there is no linear relationship b/n X & Y:
ie. that the slope of the pop. regression line = 0.
t = B1 / (st.dev. B1), t should fit Student's t distrib. with N-2 d.f.
That the intercept = 0:
95% Confidence interval of B1:
95% confidence means that, if repeated samples are drawn from a population under the same conditions, & 95% confidence intervals are calculated, 95% of the intervals will contain the unknown parameter B1. Since the parameter value is unknown, it is not possible to determine whether or not a particular interval contains it.
Goodness of Fit:
Searching for violations of assumptions: