Statistical methods for calculating confidence intervals, tolerance intervals, and capability analysis to reduce out-of-specification situations.
This article is the third in a four-part series on essential statistical techniques for any scientist or engineer working in the biotechnology field. This installment deals with statistical methods for calculating confidence intervals, tolerance intervals, and capability analysis. The difference between the confidence interval and tolerance interval is explained.
Steven Walfish
In the last installment, the concept of hypothesis testing was presented. In hypothesis testing, we must state the assumed value of the population parameter. In constructing intervals, on the other hand, we determine the range with a certain confidence for the true population parameter. This confidence level is usually set at 95%, though you might also see intervals that are 99% or 90%. A confidence interval is used to estimate the population mean or standard deviation. A tolerance interval is used to estimate the distribution of the individual values in the population. This distinction will become more apparent in the discussion of capability analysis.
The amount of standard errors(s/√n) the observed mean is from the hypothesized population mean can be determined by hypothesis testing. The conclusion is either the mean was statistically different or there was a lack of evidence to reject the null hypothesis. A similar analysis can be performed using confidence intervals. If the hypothesized mean fits within the confidence interval, it is an indication that the sample mean is not statistically different from the hypothesized mean. We will concentrate on the confidence interval when the population variance is unknown. When the population variance is unknown, the t-distribution, which takes into account the uncertainty in estimating the sample variance, is used. The t-distribution is tabled by confidence level and degrees of freedom. The degrees of freedom are the number of observations used to estimate the sample standard deviation minus one. The formula for the confidence interval is as follows:
in which X mean is the sample mean, t1-α/2;n-1 is the t-value from the t-table with a confidence level of 1-α and n-1 degrees of freedom, s is the sample standard deviation used to calculate the sample mean, and n is the sample size used to estimate the mean and standard deviation. Using the protein concentration data from Part 2 of this article series (BioPharm International, June 2008), a confidence interval can be calculated. Table 1 shows the data and calculations for the confidence interval. If the theoretical concentration was thought to be 30, the 95% confidence interval shows that 30 is contained in the interval, therefore the mean of 31.70 is not different from the value of 30.
Another way to look at the relationship between a confidence interval and hypothesis testing is to look at the hypothesis test to determine if the observed mean of 31.70 is statistically different from 28.47 (the lower confidence interval). The formula for the t-test would be:
which gives the same value as the t0.95;5 in Table 1.
Table 1. Data and calculations for the confidence interval
The confidence interval for the two-sample condition is to develop an interval that determines if the difference between the means contains zero. The approach is similar to the one used in the above formula. Some textbooks use a pooled standard deviation (sp) when the population standard deviations are unknown, but assumed to be equal; and the samples sizes are small (under 30):
A confidence interval covers a population parameter with a stated confidence, that is, a certain proportion of the time. There is also a way to cover a fixed proportion of the population with a stated confidence. Such an interval is called a tolerance interval. Statistical tolerance intervals are limits in which we expect a stated proportion of the individual values of the population to lie. For a sample mean (X mean) and sample standard deviation (s), the general formula for a tolerance interval is:
The value of k is based on the sample size (n) and the confidence level (1-α). The k factor can be obtained from the appropriate table contained in ISO-16269-6 Statistical Interpretation of Data reference or calculated by statistical software.
Revisiting the data from Table 1, the tolerance interval calculations for a 95% confidence and 95% coverage can be found in Table 2.
Table 2. Tolerance interval calculations for 95% confidence and 95% coverage
The tolerance interval means we are 95% confident that 95% of the individual values in the population whose mean is 31.7 with a standard deviation of 3.07 will be in the range of 18.12 to 45.28.
The assumption for confidence intervals and tolerance intervals is that the data are an independent random sample from a single population that is normally distributed. Typically, the true mean and standard deviation from the population are not known, and are therefore estimated from the sample.
Process capability compares the output of a process to the specification limits. The comparison is made by forming the ratio of the spread between the process specifications (the specification "width") to the spread of the process values, as measured by process standard deviation. A capable process is one in which almost all the measurements fall inside the specification limits. Using confidence intervals or tolerance intervals, the percent of the values that would fall outside the specification can be calculated. This is best explained using an example.
Assume we have a process whose specification is that the mean of 5 samples must be between 40 and 50. If we sampled from the process and observed a mean of 46 and standard deviation of 2 based on a sample of 5 units, the 95% confidence interval would be 42.9 to 49.1. The 99% confidence interval would be 41.0 to 51.0. Based on the 99% confidence interval, we can see that we would expect the mean to be outside the upper specification more than 0.5% of the time (because the entire probability of exceeding the interval is 1%, then the probability of exceeding just the upper interval is 0.5%).
The exact value for the percent of the means of a sample size equal to 5 that would exceed the specification based on the observed mean and standard deviation can be calculated. This is accomplished by solving for the t-value in the confidence interval equation by setting the specification equal to the confidence interval limits. Because the mean is not centered in the specification, it will be necessary to break the problem into two parts, solving for the lower and upper limits individually, and then adding the probabilities together. Table 3 shows the results of these calculations.
Table 3. Analysis of 99% confidence interval
Confidence limits are limits in which we expect a given population parameter to fall. Statistical tolerance limits are limits in which we expect a stated proportion of the population to lie. Using the confidence interval and tolerance interval, specifications can be set that minimize the number of out-of-specification situations. A combination of tolerance intervals and confidence limits defines the overall process parameter (mean) and the distribution of the individuals.
Steven Walfish is the president of Statistical Outsourcing Services, Olney, MD, 301.325.3129, steven@statisticaloutsourcingservices.com