A better method for trend analysis than CUSUM and control charts.
Change-point analysis is an effective and powerful statistical tool for determining if and when a change in a data set has occurred. The tool provides a confidence level that indicates the likelihood of the change. Change-point analysis can be used in three distinct applications: 1) determining if improvements or process changes may have led to a shift in an output, 2) problem solving, and 3) trend analysis. This paper describes how the tool can be used in the pharmaceutical industry for the three applications. Two case studies are presented to show how change-point analysis was used to verify the potential effect of process changes.
One question that is commonly asked during the analysis of time-ordered data is "Has there been a shift in the mean of the data?" One technique for assessing if and when a shift has occurred is a cumulative sum chart (CUSUM chart). CUSUM charts rely on a visual assessment of whether there is a change in the slope of the CUSUM plot. This technique works well with large changes because they produce an obvious change in slope. Subtle changes in slope are more difficult to detect and may be missed. Also, it is difficult to determine if a less pronounced change in slope represents a significant change. Potential change points identified with CUSUM charts can be confirmed by running a t-test of the data before and after the suspected change point. However, a t-test should only be used if the data is normally distributed and the change point is known. Bandurek provides a review of the use of CUSUM charts.1
Baxter Healthcare
Another statistical tool that uses CUSUM charts to look for shifts in data is change-point analysis. This approach goes one step further by assigning a confidence level for each change detected. The confidence level is determined using a technique known as bootstrapping, which takes the subjectivity out of a CUSUM analysis.
This paper describes how to conduct a change-point analysis and discusses three applications for the tool in pharmaceutical process monitoring and control: demonstrating improvements, problem solving, and trend analysis. The use of change-point analysis to demonstrate an improvement is highlighted in two examples.
Change-point analysis uses cumulative sum and bootstrapping techniques to identify changes and assign a confidence level to the change.2 First, a CUSUM chart is generated, which displays the cumulative sum of the differences between individual data values and the mean. If there is no shift in the mean of the data, the chart will be relatively flat with no pronounced changes in slope. Also, the range (the difference between the highest and lowest data points) will be small. A data set with a shift in the mean will have a slope change at the data point where the change occurred, and the range will be relatively large.
Figure 1. Example of a CUSUM chart. Each point plotted represents the difference between an individual data point and the mean, which is added to or subtracted from the previous point on the graph (depending on whether the difference between the individual data point is positive or negative). The data for this CUSUM chart is shown in Figure 2.
Figure 1 shows an example of a CUSUM chart. Data for this chart is listed in the Excel table in Figure 2. Column B contains the raw data and column C contains the cumulative sum. The cumulative sum at each data point is calculated by adding the difference between the current value and mean to the previous sum [i.e., Si = Si–1 + (Xi – Xbar) for i = 1 to n, where S is the cumulative sum, Xi is the current value, and Xbar is the mean]. A CUSUM chart starting at zero will always end with Sn = 0. If a CUSUM chart slopes down, it doesn't necessarily mean that the data are trending down. Rather, it indicates a period in time when most of the data are below the mean. A sudden change in direction of a CUSUM indicates a shift in the average. From the CUSUM chart in Figure 1, it appears that a change may have occurred at data point 20. At this point, the slope changes direction and increases, indicating that most of the data points are now greater than the average. The point at which the CUSUM chart is furthest from the baseline of zero is the estimated point of change. Interpreting the CUSUM chart would lead one to the conclusion that a shift in the mean occurred at data point 20. This interpretation relies on a subjective assessment as to whether there is a change in the slope.
Figure 2. Excel table with data used to generate the CUSUM chart in Figure 1
Change-point analysis builds on a CUSUM chart by determining a confidence level for each change. Confidence levels are calculated using a technique known as bootstrapping, whereby many random iterations of the data set are generated. For each randomized data set, the corresponding cumulative sums are determined, along with the ranges for the sums. The percent of times that the cumulative sum range for the original data exceeds the cumulative sum range for the randomized bootstrap data is the confidence level. The idea behind this algorithm is that if a pronounced change has occurred, the range on the CUSUM chart for the original data will be large, and randomizing the data will not lead to data sets with larger ranges, or will do so only infrequently.
Figure 3. Bootstraps of original data from Figure 2. The average of each bootstrap data set is shown at the bottom. Bootstraps are data sets in which the original data has been randomly reordered. For this example, 100 bootstraps were conducted, the first 15 of which are shown here.
In the sample data set from Figure 2, 100 bootstraps were produced using the Excel formula =INDEX($B$3:$B$32,1+30*RAND(),1). The first 15 bootstraps are shown in Figure 3. For each bootstrap data set, the corresponding CUSUM data are generated along with the range (difference between the highest and lowest CUSUM values). The bootstrap CUSUM values are shown in Figure 4. All the formulas used for the analysis are in Table 1.
Table 1. Microsoft Excel formulas for conducting a change-point analysis
The final step in determining the confidence level is to calculate the percent of times that the range for the original CUSUM data exceeds the range for the bootstrap CUSUM data. The CUSUM data range for the bootstraps is in row 115 of Figure 4, whereas the CUSUM range for the original data is in cell F3 of Figure 2. In this example, the confidence level was 99%. It is appropriate to set a predetermined threshold confidence level beyond which a change is considered significant. Typically, 90% or 95% is selected.
Figure 4. CUSUM values for the bootstrap data. The range for each data set is shown at the bottom in row 115.
The change at data point 20 that was indicated on the original CUSUM chart has been shown to have a 99% confidence level based on 100 bootstraps. A histogram of the CUSUM ranges for the 100 bootstraps is shown in Figure 5. As shown in the histogram, only one of the bootstrap ranges was greater than 9.8, which was the CUSUM range of the original data. Thus, the confidence level was 99%.
Figure 5. Histogram of bootstrap CUSUM ranges. The CUSUM range for the original data is indicated by the red line at 9.8.
The original CUSUM chart hints at other, more subtle changes. These potential changes can now be assigned a confidence level by dividing the data into two subsets: data points 1 to 19 and data points 20 to 30. Each data subset can then subjected to its own change-point analysis to see if the threshold confidence level is exceeded.
Potential changes in variation also can be assessed using the change-point analysis technique. Because biologics manufacturers produce a relatively small number of batches each week, it is not always practical to perform a change-point analysis on standard deviation. As an alternative, a change-point analysis is conducted on the difference between consecutive data points. For the sample data from Figure 2, let X1, X2, ..., X30 represent the 30 data points. From this, 15 consecutive differences, D1, D2, ..., D15 are calculated as follows: Di = |X2i – X2i–1| for i = 1 to 15. The change-point analysis is then performed on D1 through D15.
Microsoft Excel was used to perform the analysis described above. A commercially available software package known as Change-Point Analyzer (Taylor Enterprises, Inc.) greatly simplifies the analysis. The remaining examples in this paper used Change-Point Analyzer, version 2.2.
In the two case studies shown here, change-point analysis using Change-Point Analyzer was used to show that a process change correlated with an improvement in a measured output.
Manufacturers of active pharmaceutical ingredients typically monitor process recovery to ensure that the process is performing as expected and that yield targets are being met. The historical record of recovery also provides a baseline against which potential improvements can be measured. Change-point analysis is an effective tool to verify whether a process change has led to measurable improvements, such as an increase in recovery.
In the first example, a process change was initiated in an attempt to improve recovery of a target protein, and a change-point analysis was conducted on lots bracketing the time of the change (Figure 6). This analysis revealed that an increase in recovery occurred at lot 28 (indicated by a shift upward in the green zone and from the summary table below the plot). The actual process change was initiated at lot 24. An increase in the percent recovery was detected only a few lots after the actual process change occurred and within the 95% confidence interval indicated in the summary table for Figure 6. The change was from 48.6% to 53.7% and had a confidence level of 96%.
Figure 6. Change-point analysis of percent recovery for a plasma protein. The analysis was done using Change-Point Analyzer. The data show an upward shift in the percent recovery. The shift is centered at lot 28 and represents a shift from 48.6% to 53.7%. This change has a 96% confidence level. An additional output from the software is a confidence interval, which in this example is from lots 10 to 39. The red lines indicate control limits. The level is an indication of the importance of the change. A level 1 change is the first change detected and the one most visually apparent in the plot. A level 2 change is detected on a second pass through the data after the data are subdivided into two subsets.
In the second case study, the parameter that was being monitored was potency targeting. Different potencies of active ingredient are produced depending on the needs of the customer. To meet these needs, it is important that the actual potency be as close as possible to the potency level required by the customer (referred to as the target). The effectiveness of potency targeting is measured by how far the actual potency is from the target potency (percent from target). A recent improvement project sought to improve potency targeting. Process- and assay-related factors are both responsible for the variation in targeting accuracy. In this project, assay improvements were implemented and a change-point analysis was conducted on lots that bracketed the time of implementation of the improvements (Figure 7). Actual improvements were implemented at lot 17, and the results of the change-point analysis indicate that a shift in the mean from 9.9% to 5.8% occurred at lot 31. The confidence interval for this change is from lots 15 to 45, which encompasses when the actual change occurred.
Figure 7. Change-point analysis of potency targeting. Data shows a decrease in the percent from target at lot 31. The potency targeting improved from 9.9% to 5.8%. This change has a confidence level of 97% and a confidence interval of 15â45.
The statistical significance of changes identified by change-point analysis also can be verified by conducting a two-sample t-test. This was done for the two examples described above, and the results confirmed the statistical significance of the change (Tables 2 and 3). For each example, two t-tests were conducted. One compared data before and after the change point identified by change-point analysis, and the other compared data before and after the actual lot when the process change occurred. The p-value for all the t-tests was <0.05, indicating that with at least 95% confidence, the means are not equal. The t-tests confirmed the change-point analysis results that a shift in the mean occurred.
Table 2. T-test results from the percent recovery data
Because a normal distribution is an assumption for a t-test, it should only be used to verify the statistical significance of changes when data are normally distributed. Also, the t-test by itself does not identify when changes occur. It can only be used to confirm a hypothesized point of change, and as such is not a substitute for change-point analysis when used for problem solving and trending.
Table 3. T-test results from potency targeting data
If the timeframe of a process change corresponds to the timeframe of a shift in the output of a measured parameter, one can conclude that the process change may have caused the shift. Correspondence doesn't necessarily prove a cause-and-effect relationship between the process change and the shift in output, however. The analysis should be supplemented with process knowledge, other statistical analysis, or scaled-down experimentation to more definitively demonstrate cause and effect.
In the problem-solving mode, a retrospective change-point analysis is conducted on selected attributes to see if the timeframe of a change in one of the attributes corresponds to the occurrence of a manufacturing problem. As a hypothetical example, let's assume that there was an increase in out-of-range results for sodium concentration in a processing buffer. A change-point analysis should be run on the sodium concentration of buffer lots over time. This allows us to determine if the out-of-range batches are isolated instances or if there has been a shift in the sodium concentration of the buffer that is causing more batches to be out-of-range. The analysis should encompass enough lots before the increase in out-of-range results to provide a good representation of the variation in the data. If the change-point analysis showed that a change in the sodium concentration occurred in the same timeframe as the onset of the out-of-range results, the investigation will focus on the cause of the shift.
As part of the investigation, many potential causes may be identified, including changes in:
Change-point analysis can be conducted on any of the time-ordered data such as mixing speed for each lot, mixing time for each lot, or assay control values. If any of the results from a change-point analysis indicate a shift in the data that corresponds to the sodium concentration shift, one can narrow the scope of the investigation. Alternatively, root causes can be eliminated if a measured parameter did not change over the timeframe examined.
The hypothetical example described above shows how change-point analysis can be used to provide a retrospective analysis of a data set to see if shifts in the data correspond to the onset of a manufacturing issue. In this context, change-point analysis serves as a powerful problem-solving tool.
Control charts are commonly used to evaluate data on a lot-by-lot basis or sample-by-sample basis, and can indicate when a particular lot or sample is not part of the same population of data as the data that was used to generate the control limits (special cause variation). Furthermore, trend analysis can be done with control charts if the data is normally distributed. Likewise, change-point analysis also can be used on a real-time basis to look for trends or shifts in a process. Change-point analysis provides some advantages over control charts when looking for shifts in data.2 These advantages are indicated below.
1. Change-point analysis can detect more subtle shifts than control charts. Because biomanufacturing processes can have a lot of variation, subtle changes may not always be meaningful. However, this determination must be made separately from the determination of significant changes identified by change-point analysis.
2. Any type of data distribution can be analyzed. With control charts, a trend analysis requires a normally distributed data set.
3. Any type of data, including attributes, can be analyzed with the same change-point analysis tool. In contrast, a different type of control chart is needed for each data type.
4. Change-point analysis can lead to fewer false positives than control charts.
5. Multiple changes in the mean and variation can be detected by a single change-point analysis using Change-Point Analyzer.
Because change-point analysis does not assume a normal distribution, it works particularly well with non-normal data such as particle counts or bioburden data.
It is preferable to use change-point analysis in conjunction with control charts. One recommendation is to use a control chart to determine whether there is special cause variation for each data point as it is generated. A change-point analysis is then used on a less frequent basis to look for shifts in the data. The reason to use change-point analysis on a less frequent basis is that changes are not always detected immediately after a shift in the data. It may take a few lots or data points after a shift in the mean or variation before a change is recognized. The recommended frequency of running a change-point analysis depends on how fast data points are generated. In the biologics or recombinant protein industry, where only a few lots may be produced each week, a weekly or monthly change-point analysis is sufficient.
Change-point analysis is a powerful statistical tool for detecting shifts over time in a data set. The analysis can be run using Excel or Change-Point Analyzer, and can work with any data distribution and data type (continuous, categorical, or attribute). Multiple changes can be detected using change-point analysis in the mean or variation of a data set. It can be used in three types of analysis: 1) verification of improvements, 2) problem solving, and 3) trend analysis.
Change-point analysis offers several advantages over other methods of detecting data shifts. Cumulative sum (CUSUM) charts have historically been used for this purpose; however, that technique does not yield a confidence level. Using CUSUM charts in combination with bootstrapping, change-point analysis can produce a confidence level for every change in the mean or variation detected. The software application Change-Point Analyzer produces a confidence interval that provides 95% confidence that a change occurred within the bounds of the interval. Also, using trend analysis with control charts has several limitations that do not arise with change-point analysis. For example, change-point analysis can work with any data distribution and data type. With control charting, a different type of control chart is required for each data type and it is advised to only use trend rules on normally distributed data. Furthermore, change-point analysis can detect more subtle changes than control charts and produces fewer false positives.
The authors would like to acknowledge several people who contributed to this work. Cathy Sar, Danny Pitpit, and Allan Fajardo contributed to the potency targeting improvement project that formed the basis for one of the case studies presented here. Sundar Ramanan, PhD, contributed to the yield improvement case study. We are grateful to Yeong Wang, PhD, Michael Bellomo, and Wayne Taylor, PhD, for helpful advice in the preparation of the manuscript.
Patrick Gavit is a technical director, Yasser Baddour is a research scientist, and Rebecca Tholmer is a manager, quality, all at Baxter Healthcare Corp., Los Angeles, CA, 818.507.8237, patrick_gavit@baxter.com
1. Bandurek GR. Cumulative sum charts for problem solving. BioPharm Int. 2008 May;21(5):58–67.
2. Taylor WA. Change-Point Analysis: A Powerful New Tool For Detecting Changes. 2000. Available from: http://www.variation.com/cpa/tech/changepoint.html.