Skip to : [Content] [Navigation]
 

ASSAY DEVELOPMENT

Comparing statistical analysis methods for diagnostic devices

Why Bayesian versus frequentist statistical methods may determine the accuracy of diagnostics products

Roger F. Sewell and Amanda J. Fuller

Bayesian inference results in more- or equally-accurate diagnostic classification, especially with complex and noisy data sets such as those found when developing IVDs.
Lord Ernest Rutherford, the father of nuclear physics, once said, “If your experiment needs statistics, you ought to have done a better experiment.” The remark seems laughable today, with statistical analyses being such an irreplaceable tool in the scientific arsenal.

The statistical analysis methods that have primarily been taught in schools follow the frequentist approach. However, another approach, Bayesian statistics, is becoming more widespread. With today’s noisy data sets, Bayesian techniques allow scientists to extract the most information from an experimental data set, which can lead to the development of more-accurate diagnostic instruments.

FDA has recently recognized the value of Bayesian analysis compared with frequentist methods in conducting clinical trials for medical devices. Bayesian statistics provide a coherent way to learn from data as they accumulate and a formal mathematical technique for combining data with prior information. This can lead to smaller amounts of data needed in clinical trials for devices. The same argu­ment holds in applying Bayesian techniques to the development of processing algorithms for diagnostic instruments. Devices suited for Bayesian techniques include micro­arrays, flow cytometry, proteomics, and time-of-flight spectroscopy. Such devices produce data that can be handled effectively by Bayesian analysis.

Frequentists and Bayesians stridently advocate their statistical approach, bordering on religious fervor. This article compares Bayesian and frequentist methods and illus­trates applications in which the lesser-used Bayesian methods have significant advantages, in particular in developing diagnostic products. Because Bayesian methods ask and answer the right questions, they can lead to the development of more reliable, effective, and low-cost diagnostic solutions.

Description of Bayesian and Frequentist Methods

Bayesian techniques have become more widely used in real-world applications only during the last 30 years due to advances in the computing power required in the initial algorithm development and validation. However, once created, the algorithms can be embedded on low-cost digital signal processing chips. Although applying Bayesian inference to real-world situations requires an experienced user, the benefits that this technique offers far outweigh the initial investment.

Bayesian and frequentist methods differ in how the correspondence is constructed between mathematical objects and real-world ideas. The frequentist approach regards probabilities as measures of frequency, while the Bayesian approach regards them as degrees of belief.

In Bayesian statistics, what is known before collecting the data is the prior information, and what is known after collecting the data is the posterior information. In mathematical terms, the Bayesian technique calculates P(θ|D), the posterior probability distribution of the unknown variable θ given the data D (which is what the user actually wants to know but requires the prior probability distribution of θ). In comparison, the frequentist analysis calculates P(D ∈C|H0), the probability that the data fall in some critical region given some null hypothesis H0 about θ.

The difference between these two is pivotal. While the latter probability can be calculated without knowing the prior distribution of θ, the result is not the answer that a user actually needs. It is similar to answering a different question than what is asked in an exam because of a lack of knowledge on how to answer the original question.

Nonstatisticians often want to ask such questions as, “Does this set of samples come from a normally, also referred to as a Gaussianly, distributed variable?” In order to give meaning to such a question, a prior distribution on the distributions considered possible for the variable is essential. For example, if someone asks, “Is Martin 1.5 m tall?,” the answer depends on whether the word exactly is part of the question. If it is, then the answer is no with 100% probability. But if approximately is part of the question, then the answer depends on the exact meaning of approximately.

The same applies to questions about distributions. Very few experimentally observable variables are exactly Gaussianly distributed, even if they are approximately so. But in this case, the whole question turns on what exactly is meant by “approximately,” and without that, the question is meaningless. Unfortunately, specifying what is meant by approximately in this context is much more difficult than when measuring somebody’s height.

Figure 1. (click to enlarge) The prior distribution (left), which represents views on the probability of an ordinary coin coming down heads, and the posterior distribution given 7 out of 8 heads.
Another example of the difference between Bayesian and frequentist processing methods is demonstrated by the simple case of a coin toss. There is a prior reasonable belief that the probability of heads should be close to 0.5. Such a prior belief might be represented by a Beta distribution, such as shown as the curve on the left in Figure 1. If the coin is tossed eight times and the results are one tail and seven heads, Bayes’ theorem says that the posterior distribution on the probability of heads looks like the curve on the right in Figure 1.

Meanwhile, the frequentist defines an unbiased estimator to be one such that the expectation of the estimator is always equal to the true value. Consequently, the frequentist maximum likelihood estimate, which is unbiased in the technical sense of the frequentist’s definition, comes in at 0.875. But this is unreasonable as an estimate of the probability of a coin coming down heads or tails, because it is biased by ignoring the prior information about coins. The combined effect of these factors is that users are not encouraged to think clearly about the problems they face. They get answers to the wrong questions, which are biased by ignoring information other than the data and concentrating on local areas of high probability density rather than the behavior of the whole probability mass.

The following are two more complex examples that illustrate the types of signal processing problems that could be encountered in a diagnostic instrument. These examples discuss how the choice between Bayesian and frequentist methods could influence diagnostic instrument designers. Depending on the application, using Bayesian inference may make the results more accurate, indicate precisely what uncertainties there are, or make the project possible at all rather than impossible. (There are many different frequentist methods. This article only mentions a subset because of space constraints.)

Fluorescence Lifetime Measurements

Extracting quantities of fluors of different lifetimes present in a mixture provides an example of how Bayesian techniques can generate better results than traditional frequentist methods. Fluorescence measurements are used in a variety of IVDs, from lateral-flow tests to high-throughput screening. This example will consider the situation of two fluors of known lifetimes present in unknown amounts.

The traditional method for solving this problem is time-correlated single- photon counting (TCSPC). A sample is illuminated with pulses of laser light that are deliberately so dim that only one fluorescent photon per hundred excitation pulses is expected to come back. This is an essential feature of this approach. Even if the laser could be made brighter, this approach would then fail: since only the first photon coming back after each excitation is recorded, the arrival times available would be biased to be earlier than expected.

After each excitation, the arrival time of only the first of any photons received is recorded. After many tens of thousands of excitation cycles, a histogram of the arrival times of the fluorescent photons seen is made, and least-squares curve fitting of a family of exponential curves is done. This method amounts to using maximum-likelihood estimation under the assumption that the errors are due to Gaussian noise of constant amplitude.

In contrast, the Bayesian method of solving this problem sets the laser to its full brightness (which we will assume is sufficient to get back on average one photon per excitation, but which will give even better results if a greater number of photons is received), records the arrival times not just of the first but of any and all photons received, and applies Bayesian inference to the resulting data.

In order to evaluate the two algorithms, the figures compare the Bayesian algorithm after n excitations and the traditional frequentist algorithm after 100n excitations, so that the average total number of photons emitted is the same for the two algorithms (although the collection time for the Bayesian algorithm is 100 times shorter). The Bayesian solution is shown by the background color plotting of the posterior, the correct result by an x, and the maximum likelihood point by a plus sign.

Figure 2. (click to enlarge) Results after 10 excitations for the Bayesian algorithm and 1000 excitations for the traditional method. The truth is shown by an X in a circle, the Bayesian answer is the colored area, while the traditional method’s answer is shown by a + in a circle.
After 10 excitations for the Bayesian algorithm and 1000 for the traditional method, the results are shown in Figure 2. The Bayesian outcome gives a distribution that has localized the amount of fluor 2 better than the amount of fluor 1, and includes the correct result within its distribution. The traditional method, however, produces an outcome biased toward the top left of the figure.

Figure 3. (click to enlarge) Results after 10,000 excitations for the Bayesian algorithm and
1 million excitations for the traditional method. The truth is shown by an X in a circle, the Bayesian answer is the colored area, while the traditional method’s answer is shown by a + in a circle well to the left of the true point.
Comparisons between the two techniques can be drawn at other levels, such as 10,000 excitations for the Bayesian method and 1 million excitations for the traditional frequentist method. Figure 3 shows that the Bayesian approach can be certain to within a very small range of the true values of the brightness of both fluors (the colored area is partially hidden behind the x sign). However, the traditional approach continues to give a result biased to the left and upward, with no indication that this is wrong. Even if some form of frequentist confidence intervals were put around this traditional outcome, they would still be biased and exclude the correct result.

So in this example, the Bayesian method finds the true answer (by defining a range in which the truth lies) whereas the frequentist method gives the wrong answer and gives no indication that this is what it is doing. This has serious implications for any diagnostic device.

Despite having 100 times fewer excitations but the same overall average number of photons, several reasons explain why the Bayesian method produces better results than the traditional method. The reasons are that the Bayesian approach avoids the following problems:

  • Biasing the photon arrival times by rejecting all but the first in any excitation cycle.
  • Assuming Gaussianity inherent in least-squares curve fitting.
  • Looking for the maximum of anything; rather, it returns information on the whole posterior probability mass.

Bayesian inference has enabled use of 100 times fewer excitations and nonetheless achieved a result that is genuinely unbiased compared with the traditional TCSPC method. The Bayesian method has also provided a precise view of the uncertainty remaining in a solution for any number of excitations, even as few as 10. Relating this to an diagnostic instrument, the Bayesian posterior distribution will encompass the correct results, while processing data with the frequentist method will lead to inaccurate results.

Automated Multivariate Diagnostics

Electronic noses illustrate the signal processing needed for nonspecific multivariate assay tests. An electronic nose is an array of nonspecific sensors of volatile compounds. While each sensor in the array can respond to a wide range of different compounds by giving an electrical signal, each sensor’s pattern of response to various compounds is different. The task for the signal processing block is to identify patterns of responses that can be used for diagnostic purposes.

A similar signal-processing philosophy can be applied to multigene tests that a number of different diagnostics companies are currently developing as cancer prognostics.1 This concept of an electronic nose will be used to compare the quality of results obtained with Bayesian and frequentist methods.

The developers of any statistical signal processing algorithm require an initial set of data, or training data, to determine the algorithm parameters. In this electronic nose example, we will suppose the nose distin­guishes two groups: sick and well. In order to present and compare the results graphically, the number of sensors is limited to two (a value of 32 might be more typical).

Figure 4. (click to enlarge) The best possible classification certainty using information only available to the person who synthesized the data, and the 800 training data points.
The background color in Figure 4 represents the probability that the combined readings of the two sensors place a person in either of the two groups. This is what each of the statistical algorithms will try to recreate. Colors at the red end of the scale represent a high probability that a reading comes from an ill patient, and colors at the blue end from a well patient. However, the algorithm developers only have access to a finite number of training data points, which are represented by the black crosses (sick) and white circles (well).

Figure 5. (click to enlarge) The result of a Bayesian algorithm and the previously unseen points to be classified.
Now, classification problems come in various grades of difficulty. Some have two well-separated groups of points to be disting­uished, which can be separated by a straight line (or a hyperplane in a many-dimensional setting). Others have two well-separated groups of points, but require a curved line (or separating hypersurface) to separate them. Others have overlapping sets of points, and these may require either a straight or a curved surface to do the best possible, but imperfect, job of separating them. From these data points, one can see that the two groups are overlapping—the type of classification problem that sorts the excellent from the mediocre in the field of statistical techniques.

Figure 6. (click to enlarge) The result of applying maximum-likelihood model fitting.
The results of the Bayesian algorithm and two different frequentist algorithms, maximum likelihood and linear discriminant analysis (LDA), are compared by using two different metrics for the comparison. The first comparison considers the background color maps in Figures 4–7, which can be regarded as classification maps. The color map created by the Bayesian algorithm (Figure 5) is a significantly better match for the true classification map (Figure 4) than either of the two frequentist methods (Figures 6 and 7).

Figure 7. (click to enlarge) The result of applying linear discriminant analysis (LDA).
For the second, more-quantitative comparison, the three algorithms are used to classify 800 previously unseen data points into sick and well groups, which are the points shown in Figures 5–7. The Bayesian algorithm correctly classified 708 of 800 points. The maximum likelihood method did significantly worse, classifying only 654 of 800 points correctly, while the LDA method classified 667 of 800 points correctly.

The magnitude of the differences between the Bayesian solution and the other frequentist techniques can only be fully appreciated in more dimensions (i.e., with more component sensors in the electronic nose). This is analogous to analyzing a larger number of genes when looking for a gene signature as a method for diagnosing cancer.

Figure 8. (click to enlarge) Performance of Bayesian and various frequentist algorithms in 32 dimensions. Fraction of points wrongly classified of an unseen set of 200,000 points. MaxLike = maximum likelihood; LDA = linear discriminant analysis; NN = neural network; LR = logistic regression; PCA = principal components analysis.
Figure 8 shows the results of a similar set of runs done in 32 dimensions, comparing a variety of algorithms in classification performance, but now showing the error rates when given 200,000 unseen data points to classify, although still trained on only 800 data points. The Bayesian algorithm outperforms all the frequentist algorithms. While the maximum-likelihood algorithm appears in some ways very close to the Bayesian answer, it was actually started from a Bayesian solution to give it the most chance of success. (Maximum-likelihood algorithms require a starting point for the solution because they search for a maximum of a function of many variables.)

Figure 8 also demonstrates how Bayesian inference has a 100-fold-lower error rate than other statistical analysis methods. In terms of a diagnostic instrument, implementing a Bayesian algorithm would better enable an electronic nose to make a correct diagnosis, or a gene test to better indicate better a predisposition to cancer.

Conclusion

Frequentist methods have been the predominant statistical analysis for the larger part of the last century, to the point where their adherents consider them to be the best way to analyze data and often defend them religiously. However, this article illustrates how Bayesian inference provides more accurate diagnostic classification, particularly with complex and noisy data sets such as those found when developing diagnostic products. Another benefit of Bayesian approaches is that this statistical technique sometimes allows cost savings to be made with the sensor hardware. For example, savings of 100-fold in memory usage and 10-fold savings in central processing unit cycles needed per second, along with a large improvement in performance, have been obtained.

In addition to automated multivariate diagnostics and fluorescence applications, Bayesian techniques can be applied to other diagnostic product categories that produce large-scale data, including microarrays, flow cytometry, proteomics, and time-of-flight spectroscopy.

Another important feature of Bayesian techniques is that the user is informed about the maximum amount of information that can be extracted from the available data. This means that when conducting the analysis, the user knows how good the results are. But with frequentist methods, the user is left wondering how much better the results could be. This advantage of Bayesian inference has not been discussed in this article due to space constraints but it is a significant advantage for device developers.

However, there are many situations in which Bayesian techniques are not needed to provide adequate performance, or where an approximation to Bayesian inference will provide adequate performance. Consequently, when optimal performance is not necessary, Bayesian inference may or may not be the technique of choice. In addition, there is no doubt that because most people have been taught frequentist statistical methods at school, more effort is required to switch over and adopt the simpler Bayesian mind-set.

Nonetheless, Bayesian inference is the generally applicable technique for extracting the most possible data from difficult, messy data sets. This article has shown why further acceptance of Bayesian methodologies could lead to significant improvements in diagnostic devices.

Roger F. Sewell, DM, is a senior consultant at
Cambridge Consultants Ltd. (Cambridge, UK). He can be reached at rfs@
cambridgeconsultants.com
.
Amanda J. Fuller, PhD,
is group leader of the diagnostic products group at Cambridge Consultants Ltd. (Cambridge, UK). She can be reached at amanda.fuller@
cambridgeconsultants.com
.

 


Reference

1. S McKee, “FDA Approves First IVDMIA,” IVD Technology 13, no. 3 (2007): 15.

Copyright ©2008 IVD Technology