Bayesian Epidemiologic Screening Techniques

Graduate Group in Epidemiology | University of California, Davis
| Bayesian Epidemiologic Screening Techniques (BEST) Laboratory at University of California, Davis |
| Software Modules | Prevalence Estimation | Disease Freedom | Diagnostic Test Se and Sp Estimation | More |
| Methodological Papers | Applications |
| Presentations and Talks |
| Workshops |
| Statistics | Medicine and Epidemiology | Master of Preventive Veterinary Medicine |
| Glossary of Epidemiological Terms |
| Academic / Research Related | Scientific Journals | Software Sites | Web-based Hot Topics and Lists |
| Research Members Contact Information |

Glossary of Epidemiological Terms

 

Diagnostic Test: A test that directly measures a sign, substance, response, or tissue change that is either an absolute or reasonable surrogate predictor of a disease or disease agent. Frequently, diagnostic test results are either continuous (such as an optical density reading using an ELISA), ordered (such as serum neutralization titers), or dichotomous (a precipitate is present or not on an AGID). By convention, diagnostic tests based on continuous or ordered results are frequently dichotomized for decision-making purposes. The accuracy is commonly measured by their sensitivity and specificity.

Cutoff: For an ordinal or continuous diagnostic test, a cutoff value, k, is a value that determines the test outcome, e.g. positive if the test value exceeds k and negative otherwise.

False Negative: An individual that is truly positive for a disease, but which a diagnostic test classifies as disease-free.

False Positive: An individual that is truly disease-free, but which a diagnostic test classifies as positive for a disease.

True Negative: A negative test result for an individual that is truly negative for a particular disease.

True Positive: A positive test result for an individual that is truly positive for a particular disease.

Accuracy (syn. validity): The ability of a diagnostic test to produce correct test results. Measures of diagnostic accuracy include sensitivity and specificity.

Sensitivity: The probability that a dichotomous test yields a positive result, given that the true status of the individual tested is positive for the disease. For example, if Y denotes a test result (0 = negative, 1 = positive) and Z denotes the true status of an individual (again, 0 = negative, 1 = positive), then:

Sensitivity = Pr (Y = 1 | Z = 1) = Se

Specificity: The probability that a dichotomous test returns a negative result, given that the true status of the individual tested is negative for the disease. For example, if Y denotes a test result (0 = negative, 1 = positive) and Z denotes the true status of an individual (again, 0 = negative, 1 = positive) , then:

Specificity = Pr (Y = 0 | Z = 0) = Sp

Gold Standard: A diagnostic test that has perfect sensitivity and perfect specificity.

Receiver Operating Characteristic (ROC) Curve: For an ordinal or continuous diagnostic test, this curve gives the plot of all pairs (Se(k), 1 - Sp(k)) over all possible cutoff values, k.

Conditional Independence: Two tests are conditionally independent when the sensitivity (or specificity) of the second test (T2) does not depend on whether results of the first test (T1) are positive or negative among infected (or non-infected) individuals. For example, if a test with sensitivity = 0.90 is used to test a population of 100 infected animals, we would expect 10 animals to yield false negative test results. If a second test with a sensitivity = 0.80 is used to test the 10 animals that initially tested negative and the 2 tests were conditionally independent, then 8 of 10 animals would be expected to test positive on the second test. Hence, the sensitivity of the second test is 0.80 regardless of results of the first test, i.e.

Pr (T2+ | T1+, infected) = Pr (T2+ | T1-, infected) = Pr (T2+ | infected) = 0.80

Similarly if T2 were performed first, conditional independence means that

Pr (T1+ | T2+, infected) = Pr (T1+ | T2-, infected) = Pr (T1+ | infected) = 0.90

We note that if either of the tests is perfectly sensitive (specific) then the test sensitivities (specificities) are conditionally independent, by definition. The terms “dependence” and “correlation” are used interchangeably by some authors, but the former term is preferable when binary tests are used.

Prevalence: The fraction of a sample of individuals that has some disease of interest at a particular point in time. Prevalence is frequently denoted by either p, or the Greek letter for p, namely pi. For example, if n individuals are sampled at a given time, and y individuals are classified as positive for the disease in question, then the prevalence of that disease at that point in time is estimated to be y/n.  If the entire population has been sampled, then the prevalence, p, is exactly y/n.

Apparent Prevalence (AP): The probability that a randomly selected unit of analysis has a positive test result. The apparent prevalence can be expressed using the prevalence (p), sensitivity (Se), and specificity (Sp) as:

AP = p*Se + (1 - p)(1 - Sp)

Predictive Value Positive (PVP): The probability that an individual is truly positive for a disease, given that a dichotomous test returns a positive result. For example, if Y denotes a test result (0 = negative, 1 = positive) and Z denotes the true status of an individual (again, 0 = negative, 1 = positive) , then:

PVP = Pr (Z = 1 | Y = 1)

Note that PVP is dependent both on the diagnostic test characteristics (sensitivity and specificity), and prevalence. Letting p = prevalence, Se = sensitivity, and Sp = specificity, the PVP is given by:

PVP = p*Se / [p*Se + (1 - p)(1 - Sp)]

Predictive Value Negative (PVN): The probability that an individual is truly disease negative, given that a dichotomous test returns a negative result. For example, if Y denotes a test result (0 = negative, 1 = positive) and Z denotes the true status of an individual (again, 0 = negative, 1 = positive) , then:

PVN = Pr (Z = 0 | Y = 0)

Note that PVN is dependent both on the diagnostic test characteristics (sensitivity and specificity), and prevalence. Letting p = prevalence, Se = sensitivity, and Sp = specificity, the PVN is given by:

PVN = (1-p)Sp / [(1 - p) Sp + p (1 - Se)]

Likelihood: The joint probability (or density) of observing the data that was actually seen regarded as a function of all the unknown parameters, say q, written as L(q). In statistics and epidemiology, the term "likelihood" has a very specific meaning and should not be interchanged with descriptors that refer to probability or frequency of events. The method of maximum likelihood uses iterative maxima-seeking algorithms to find those values of q that maximize L(q).

Logistic Regression Model: A subclass of the Generalized Linear model in which a dichotomous outcome is modeled as a function of regression coefficients and covariates using a logit link. The logistic regression model, like the probit and complimentary log-log regression models, is a natural choice when modeling probabilities, such as prevalence and incidence proportions. Moreover, simple functions of the coefficients, such as odds ratios, are estimated using logistic regression. These quantities have epidemiologic meaning under a variety of conditions, such as the rare disease assumption. More formally, the logistic regression model is written as follows:

logit(p) = x'b, y ~ binomial(n, p)

where p is a probability of interest, usually prevalence or an incidence proportion in epidemiologic studies, x is a vector of covariates, where the first covariate is a 1, b is the vector of regression coefficients, y is the number of individuals with the particular covariate pattern x that are disease positive, and n is the number of individuals with a particular covariate pattern x. Note that since logit(p) = ln [p / (1 - p)], it follows that p = exp(x'b) / [1 + exp(x'b)]. Note also that the odds ratio corresponding to a particular variable, say, xi, is given by exp(bi).

Logit Link: A function that transforms a probability, with support on [0,1], to a quantity with support equal the the entire real line. The link function is applied to the probability of success for some outcome modeled using logistic regression. For some probability, p, the logit link is defined as:

logit(p) = ln [p / (1 - p)]

Bayesian Statistics: The process by which prior uncertainty about a quantity or quantities is formally described and, through the application of Bayes' Theorem, updated after observing data. Statistics is primarily concerned with the modeling and analysis of data, either to assist in the appreciation of some underlying mechanism, or to facilitate effective decisions. In both cases, there is uncertainty and the statistician's tasks are both to reduce this uncertainty and to explain it as clearly as possible.  The Bayesian method stems from the appreciation that all uncertainty must be described by probability and that probability laws must be obeyed in order to produce coherent statistical inferences.  This view contends that probability is the only sensible language for dealing with the logic of uncertainty.

Prior: A probability distribution reflecting previous experimental data and or scientific judgment that provides the basis for a Bayesian statistical model. When appropriately combined with the observed data, the prior is "updated" to provide the "posterior distribution", which is used to make inferences and draw conclusions.

Posterior Distribution: The posterior distribution is the probability distribution that reflects uncertainty about a parameter or parameters of interest, after combining scientific information with data, using Bayes' theorem.

Credible Interval (syn. "posterior probability interval" or "credibility interval"): The calculated interval that has a specified (subjective) probability of containing a parameter of interest (such as a regression coefficient, or hazard ratio, for example), given the observed data. For example, if one obtained a 95% credible interval for some parameter, say, sensitivity, of (0.85, 0.96) with a mode of 0.92, then we would conclude that the most likely value of sensitivity was 0.92 and that we were 95% certain that the true value of sensitivity was between 0.85 and 0.96.

Gibbs Sampler: An analytical approach to approximating complex posterior distributions in Bayesian analyses, where the full conditional distributions of the constituent parameters of the posterior distribution are sampled using Markov chain Monte Carlo (MCMC) methods. For instance, for a posterior distribution of interest, say, p(a,b,g|data), a Gibbs sampler would be performed by selecting initial values of the parameters, say a(1), b(1), and g(1), and then iteratively sampling:

(1) a(i+1) | b(i), g(i), data from the full conditional for a,
(2) b(i+1) | a(i+1), g(i), data from the full conditional for b, and
(3) g(i+1) | a(i+1), b(i+1), data from the full conditional for g, and then continuing with
(4) a(i+2) | b(i+1), g(i+1), data from the full conditional for a,
etc.,

for i = 1, 2, ... , MC, where MC is the Monte Carlo sample size. Inferences about the parameters a, b, and g, are then based on the numerical approximations of their respective posterior distributions, that are essentially obtained as histograms of these Monte Carlo samples.

Confounder: A variable that is associated with an independent variable as well as the outcome of interest. For example, if using logistic regression to model the presence of a disease, Y, as a function of some covariate of interest, X, then a confounding variable, W, would be associated both with X and Y. The estimated regression coefficient relating X with Y, say b, would be biased if W were not included in the regression model along with X. Thus, if W were excluded, one would say that the estimated effect of X on Y was confounded by W.

Incidence Proportion (IP): The proportion of healthy individuals that develop some disease of interest during a defined period of time. For example, if at time = 0, a sample of n individuals is completely disease free, but by time = 1, y individuals have experienced a disease "incident," then the IP of that disease is equal to y/n during the time interval from 0 to 1. Note that specification of the time interval associated with the fraction y/n is an important component of the definition of the incidence proportion. Note also that in some epidemiology texts and journal articles, the term "cumulative incidence" is used in place of incidence proportion.

 

©2000-2007 Department of Medicine and Epidemiology, University of California, Davis