The key to a good graphical presentation is to select the method that best fits the data.
This article is the first in a four-part series on essential statistical techniques for any scientist or engineer working in the biotechnology field. Though many of the techniques seem trivial, they are often misused or misunderstood in practice. Future topics will include hypothesis testing, confidence intervals, Design of Experiments, and analysis of variance. This column deals with the ways data can be presented to maximize effectiveness, including methods to summarize data sets.
Steven Walfish
There are two main types of data: quantitative and qualitative. Quantitative, or numerical, data are continuous data sets with an infinite number of possible values. For example, protein concentration is considered continuous data, because its value is limited by the sensitivity of the measurement device. Qualitative, or categorical, data have a finite number of possible values. For example, the number of defective vials in a lot is considered qualitative because its values range from zero to the number of vials in the lot incremented in whole units.
Each of the data types has different statistical methods used for summarizing and reporting.
Qualitative data are usually presented in tabular format or as a percentage. Graphically, a bar graph can be used to present the data. A bar graph can be used to present a single variable or to compare two or more variables. Figure 1 and Table 1 show a presentation of the reasons for a lot failure.
Figure 1. A example of a bar graph
It is not enough just to plot the data to compare the frequencies by year. If the number of lots each year is different, it is preferred to plot the percentages to make a better comparison. Figure 2 shows the preferred graph.
Table 1. The reasons for a lot failure
The comparative graph in Figure 1 does not show the vast improvement between 2007 and 2006 that can be seen in Figure 2. Additional statistical tests such as the chi-squared test can be used to show that 2007 showed a statistically better percent defective when compared to 2006 (p = 0.005).
Figure 2. A preferred bar graph for comparing data from two years. The table below the bar graph makes it clear that the number of lots was different.
Quantitative, or continuous, data are preferred because of their ability to estimate and predict the true population values. Though qualitative data can be used to estimate and predict the true population values, they typically require larger sample sizes to accomplish the task. There are several summary statistics that are used with quantitative data. The most common is the mean or average of the data. Another estimate for central tendency is the median, or 50th percentile of the data. Even though the mean is the most widely used, it is not appropriate for highly skewed distributions and is less efficient than other measures of central tendency when extreme scores are possible. The median is useful because its meaning is clear and it is more efficient than the mean in highly-skewed distributions. Another good estimate for the central tendency is the geometric mean if all the values are positive and the distribution has a positive skew. The geometric mean is computed by taking the average of the logarithms of all the values and raising the base of the logarithm used to the resultant average. If the distribution is skewed positively, the mean will be larger than the median; if it is skewed negatively, the mean is smaller than the median. When a distribution is symmetrical, the mean and the median are equal.
The standard deviation or the square root of the variance is by far the most widely used measure of spread. The variance is the average squared deviation from the mean of the data. A key point to remember is that the variance can be averaged but the standard deviation cannot.
The range is another estimate of the dispersion of the data, but it takes into account only two scores, the maximum and minimum value. A very handy method for comparing variability is the coefficient of variation (CV), sometimes called the relative standard deviation (RSD). The coefficient of variation measures variability in relation to the mean and is used to compare the relative dispersion in one type of data with the relative dispersion in another type of data. The data to be compared may be in the same units, in different units, with the same mean, or with different means.
There are several methods to graphically display quantitative data. The most common methods include the line plot, box and whisker, and histogram.
Graphing data makes it easier to see patterns in the data and to confirm assumptions about the distribution of the results. A line plot is a two- dimensional plot of data, usually over time, used to detect trends in the data. Line plots are used in conjunction with other statistical techniques such as control charts for process control. A control chart is a line plot with statistical limits set at ±3 standard deviations from the mean. Based on the normal distribution, 99.7% of the data should be within these limits. Figure 3 is an example of a line plot. The horizontal line is the mean of the 30 lots.
Figure 3. An example of a line plot. The horizontal line is the mean of the 30 lots.
A box and whisker, or simply a box plot is a graphical representation of dispersion of the data. Figure 4 represents the lower quartile (Q1), upper quartile (Q3), and median. The box includes the range of scores falling into the middle 50% of the distribution. The whiskers i.e, the vertical lines extending from the box usually are set at 1.5 times the interquartile range (Q3–Q1). Points that are outside of the whiskers are usually candidates for outlier analysis. The box plot also can be used to compare different lots or batches. A t-test would be used to statistically compare two different lots. If you have more than two lots to compare, a one-way analysis of variance (ANOVA) would be used.
Figure 4. An example of a box plot
The histogram plots the frequency of events where categories are ordered by x-values. If the data are continuous, they can be "binned" to create the cut points for the histogram. There is no best number of bins and different bin sizes can reveal different features of the data. Sometimes experimentation with different bins sizes can highlight the salient points of the data. A histogram shows the shape of the data distribution, which is useful for checking the assumption of normality of the data. A popular method for testing normality is the Andersen-Darling method, which can be found in most statistical software. Figure 5 shows how two data sets can be compared. The top graphic shows the two separate histograms; in the bottom graph, the two distributions have been overlaid.
Figure 5. Two examples showing different ways to compare data sets. In the bottom graph, the two distributions have been overlaid.
Typically, continuous data are converted to qualitative or binary response. For example, the specification for pH is between 6.9 and 7.1 but we code values within the limits as pass and those outside the limits as fail. Though this makes for an easier disposition of the lot, it does not allow for extensive data analysis during an investigation. Another common problem with data analysis occurs when the measurement system is inadequate so data is binned into a few unique values. For example, we measure a weight to the closest 0.1 mg, although the measurement device can measure to the closest 0.01 mg. This is commonly done to adhere to significant figures in a specification, but can lead to poor data analysis.
Data presentation should be designed to ensure the correct conclusions. Though graphical methods give an overview of the data, more rigorous statistical methods help to separate normal variability from special cause variability. The key to a good graphical presentation is to select the method that best fits the data.
Steven Walfish is the president of Statistical Outsourcing Services, Olney, MD, 301.325.3129, steven@statisticaloutsourcingservices.com.