Search This Blog

Saturday, May 25, 2019

Measurement uncertainty

From Wikipedia, the free encyclopedia

In metrology, measurement uncertainty is the expression of the statistical dispersion of the values attributed to a measured quantity. All measurements are subject to uncertainty and a measurement result is complete only when it is accompanied by a statement of the associated uncertainty, such as the standard deviation. By international agreement, this uncertainty has a probabilistic basis and reflects incomplete knowledge of the quantity value. It is a non-negative parameter.

The measurement uncertainty is often taken as the standard deviation of a state-of-knowledge probability distribution over the possible values that could be attributed to a measured quantity. Relative uncertainty is the measurement uncertainty relative to the magnitude of a particular single choice for the value for the measured quantity, when this choice is nonzero. This particular single choice is usually called the measured value, which may be optimal in some well-defined sense (e.g., a mean, median, or mode). Thus, the relative measurement uncertainty is the measurement uncertainty divided by the absolute value of the measured value, when the measured value is not zero.

Background

The purpose of measurement is to provide information about a quantity of interest – a measurand. For example, the measurand might be the size of a cylindrical feature, the volume of a vessel, the potential difference between the terminals of a battery, or the mass concentration of lead in a flask of water. 

No measurement is exact. When a quantity is measured, the outcome depends on the measuring system, the measurement procedure, the skill of the operator, the environment, and other effects. Even if the quantity were to be measured several times, in the same way and in the same circumstances, a different measured value would in general be obtained each time, assuming the measuring system has sufficient resolution to distinguish between the values. 

The dispersion of the measured values would relate to how well the measurement is performed. Their average would provide an estimate of the true value of the quantity that generally would be more reliable than an individual measured value. The dispersion and the number of measured values would provide information relating to the average value as an estimate of the true value. However, this information would not generally be adequate. 

The measuring system may provide measured values that are not dispersed about the true value, but about some value offset from it. Take a domestic bathroom scale. Suppose it is not set to show zero when there is nobody on the scale, but to show some value offset from zero. Then, no matter how many times the person's mass were re-measured, the effect of this offset would be inherently present in the average of the values. 

Measurement uncertainty has important economic consequences for calibration and measurement activities. In calibration reports, the magnitude of the uncertainty is often taken as an indication of the quality of the laboratory, and smaller uncertainty values generally are of higher value and of higher cost. The American Society of Mechanical Engineers (ASME) has produced a suite of standards addressing various aspects of measurement uncertainty. For example, ASME standards are used to address the role of measurement uncertainty when accepting or rejecting products based on a measurement result and a product specification, provide a simplified approach (relative to the GUM) to the evaluation of dimensional measurement uncertainty, resolve disagreements over the magnitude of the measurement uncertainty statement, or provide guidance on the risks involved in any product acceptance/rejection decision.

The "Guide to the Expression of Uncertainty in Measurement", commonly known as the GUM, is the definitive document on this subject. The GUM has been adopted by all major National Measurement Institutes (NMIs), by international laboratory accreditation standards such as ISO/IEC 17025 General requirements for the competence of testing and calibration laboratories which is required for international laboratory accreditation, and employed in most modern national and international documentary standards on measurement methods and technology. See Joint Committee for Guides in Metrology.

Indirect measurement

The above discussion concerns the direct measurement of a quantity, which incidentally occurs rarely. For example, the bathroom scale may convert a measured extension of a spring into an estimate of the measurand, the mass of the person on the scale. The particular relationship between extension and mass is determined by the calibration of the scale. A measurement model converts a quantity value into the corresponding value of the measurand. 

There are many types of measurement in practice and therefore many models. A simple measurement model (for example for a scale, where the mass is proportional to the extension of the spring) might be sufficient for everyday domestic use. Alternatively, a more sophisticated model of a weighing, involving additional effects such as air buoyancy, is capable of delivering better results for industrial or scientific purposes. In general there are often several different quantities, for example temperature, humidity and displacement, that contribute to the definition of the measurand, and that need to be measured. 

Correction terms should be included in the measurement model when the conditions of measurement are not exactly as stipulated. These terms correspond to systematic errors. Given an estimate of a correction term, the relevant quantity should be corrected by this estimate. There will be an uncertainty associated with the estimate, even if the estimate is zero, as is often the case. Instances of systematic errors arise in height measurement, when the alignment of the measuring instrument is not perfectly vertical, and the ambient temperature is different from that prescribed. Neither the alignment of the instrument nor the ambient temperature is specified exactly, but information concerning these effects is available, for example the lack of alignment is at most 0.001° and the ambient temperature at the time of measurement differs from that stipulated by at most 2 °C. 

As well as raw data representing measured values, there is another form of data that is frequently needed in a measurement model. Some such data relate to quantities representing physical constants, each of which is known imperfectly. Examples are material constants such as modulus of elasticity and specific heat. There are often other relevant data given in reference books, calibration certificates, etc., regarded as estimates of further quantities. 

The items required by a measurement model to define a measurand are known as input quantities in a measurement model. The model is often referred to as a functional relationship. The output quantity in a measurement model is the measurand. 

Formally, the output quantity, denoted by , about which information is required, is often related to input quantities, denoted by , about which information is available, by a measurement model in the form of
where is known as the measurement function. A general expression for a measurement model is
It is taken that a procedure exists for calculating given , and that is uniquely defined by this equation.

Propagation of distributions

The true values of the input quantities are unknown. In the GUM approach, are characterized by probability distributions and treated mathematically as random variables. These distributions describe the respective probabilities of their true values lying in different intervals, and are assigned based on available knowledge concerning . Sometimes, some or all of are interrelated and the relevant distributions, which are known as joint, apply to these quantities taken together.

Consider estimates , respectively, of the input quantities , obtained from certificates and reports, manufacturers' specifications, the analysis of measurement data, and so on. The probability distributions characterizing are chosen such that the estimates , respectively, are the expectations of . Moreover, for the th input quantity, consider a so-called standard uncertainty, given the symbol , defined as the standard deviation of the input quantity . This standard uncertainty is said to be associated with the (corresponding) estimate

The use of available knowledge to establish a probability distribution to characterize each quantity of interest applies to the and also to . In the latter case, the characterizing probability distribution for is determined by the measurement model together with the probability distributions for the . The determination of the probability distribution for from this information is known as the propagation of distributions.

The figure below depicts a measurement model in the case where and are each characterized by a (different) rectangular, or uniform, probability distribution. has a symmetric trapezoidal probability distribution in this case. 

An additive measurement function with two input quantities '"`UNIQ--postMath-00000020-QINU`"' and '"`UNIQ--postMath-00000021-QINU`"' characterized by rectangular probability distributions

Once the input quantities have been characterized by appropriate probability distributions, and the measurement model has been developed, the probability distribution for the measurand is fully specified in terms of this information. In particular, the expectation of is used as the estimate of , and the standard deviation of as the standard uncertainty associated with this estimate.

Often an interval containing with a specified probability is required. Such an interval, a coverage interval, can be deduced from the probability distribution for . The specified probability is known as the coverage probability. For a given coverage probability, there is more than one coverage interval. The probabilistically symmetric coverage interval is an interval for which the probabilities (summing to one minus the coverage probability) of a value to the left and the right of the interval are equal. The shortest coverage interval is an interval for which the length is least over all coverage intervals having the same coverage probability.

Prior knowledge about the true value of the output quantity can also be considered. For the domestic bathroom scale, the fact that the person's mass is positive, and that it is the mass of a person, rather than that of a motor car, that is being measured, both constitute prior knowledge about the possible values of the measurand in this example. Such additional information can be used to provide a probability distribution for that can give a smaller standard deviation for and hence a smaller standard uncertainty associated with the estimate of .

Type A and Type B evaluation of uncertainty

Knowledge about an input quantity is inferred from repeated measured values ("Type A evaluation of uncertainty"), or scientific judgement or other information concerning the possible values of the quantity ("Type B evaluation of uncertainty"). 

In Type A evaluations of measurement uncertainty, the assumption is often made that the distribution best describing an input quantity given repeated measured values of it (obtained independently) is a Gaussian distribution. then has expectation equal to the average measured value and standard deviation equal to the standard deviation of the average. When the uncertainty is evaluated from a small number of measured values (regarded as instances of a quantity characterized by a Gaussian distribution), the corresponding distribution can be taken as a t-distribution. Other considerations apply when the measured values are not obtained independently. 

For a Type B evaluation of uncertainty, often the only available information is that lies in a specified interval []. In such a case, knowledge of the quantity can be characterized by a rectangular probability distribution with limits and . If different information were available, a probability distribution consistent with that information would be used.

Sensitivity coefficients

Sensitivity coefficients describe how the estimate of would be influenced by small changes in the estimates of the input quantities . For the measurement model , the sensitivity coefficient equals the partial derivative of first order of with respect to evaluated at , , etc. For a linear measurement model
with independent, a change in equal to would give a change in This statement would generally be approximate for measurement models . The relative magnitudes of the terms are useful in assessing the respective contributions from the input quantities to the standard uncertainty associated with . The standard uncertainty associated with the estimate of the output quantity is not given by the sum of the , but these terms combined in quadrature, namely by an expression that is generally approximate for measurement models :
which is known as the law of propagation of uncertainty. 

When the input quantities contain dependencies, the above formula is augmented by terms containing covariances, which may increase or decrease .

Uncertainty evaluation

The main stages of uncertainty evaluation constitute formulation and calculation, the latter consisting of propagation and summarizing. The formulation stage constitutes
  1. defining the output quantity (the measurand),
  2. identifying the input quantities on which depends,
  3. developing a measurement model relating to the input quantities, and
  4. on the basis of available knowledge, assigning probability distributions — Gaussian, rectangular, etc. — to the input quantities (or a joint probability distribution to those input quantities that are not independent).
The calculation stage consists of propagating the probability distributions for the input quantities through the measurement model to obtain the probability distribution for the output quantity , and summarizing by using this distribution to obtain
  1. the expectation of , taken as an estimate of ,
  2. the standard deviation of , taken as the standard uncertainty associated with , and
  3. a coverage interval containing with a specified coverage probability.
The propagation stage of uncertainty evaluation is known as the propagation of distributions, various approaches for which are available, including
  1. the GUM uncertainty framework, constituting the application of the law of propagation of uncertainty, and the characterization of the output quantity by a Gaussian or a -distribution,
  2. analytic methods, in which mathematical analysis is used to derive an algebraic form for the probability distribution for , and
  3. a Monte Carlo method, in which an approximation to the distribution function for is established numerically by making random draws from the probability distributions for the input quantities, and evaluating the model at the resulting values.
For any particular uncertainty evaluation problem, approach 1), 2) or 3) (or some other approach) is used, 1) being generally approximate, 2) exact, and 3) providing a solution with a numerical accuracy that can be controlled.

Models with any number of output quantities

When the measurement model is multivariate, that is, it has any number of output quantities, the above concepts can be extended. The output quantities are now described by a joint probability distribution, the coverage interval becomes a coverage region, the law of propagation of uncertainty has a natural generalization, and a calculation procedure that implements a multivariate Monte Carlo method is available.

Uncertainty as an interval

The most common view of measurement uncertainty uses random variables as mathematical models for uncertain quantities and simple probability distributions as sufficient for representing measurement uncertainties. In some situations, however, a mathematical interval might be a better model of uncertainty than a probability distribution. This may include situations involving periodic measurements, binned data values, censoring, detection limits, or plus-minus ranges of measurements where no particular probability distribution seems justified or where one cannot assume that the errors among individual measurements are completely independent.

A more robust representation of measurement uncertainty in such cases can be fashioned from intervals. An interval [a,b] is different from a rectangular or uniform probability distribution over the same range in that the latter suggests that the true value lies inside the right half of the range [(a + b)/2, b] with probability one half, and within any subinterval of [a,b] with probability equal to the width of the subinterval divided by b – a. The interval makes no such claims, except simply that the measurement lies somewhere within the interval. Distributions of such measurement intervals can be summarized as probability boxes and Dempster–Shafer structures over the real numbers, which incorporate both aleatoric and epistemic uncertainties.

Reporting bias

From Wikipedia, the free encyclopedia

In epidemiology, reporting bias is defined as "selective revealing or suppression of information" by subjects (for example about past medical history, smoking, sexual experiences). In artificial intelligence research, the term reporting bias is used to refer to people's tendency to under-report all the information available.
 
In empirical research, the term may be used to refer to authors under-reporting unexpected or undesirable experimental results, attributing the results to sampling or measurement error, while being more trusting of expected or desirable results, though these may be subject to the same sources of error. In this context, reporting bias can eventually lead to a status quo where multiple investigators discover and discard the same results, and later experimenters justify their own reporting bias by observing that previous experimenters reported different results. Thus, each incident of reporting bias can make future incidents more likely.

Reporting biases in research

Research can only contribute to knowledge if it is communicated from investigators to the community. The generally accepted primary means of communication is “full” publication of the study methods and results in an article published in a scientific journal. Sometimes, investigators choose to present their findings at a scientific meeting as well, either through an oral or poster presentation. These presentations are included as part of the scientific record as brief “abstracts” which may or may not be recorded in publicly accessible documents typically found in libraries or the World Wide Web. 

Sometimes, investigators fail to publish the results of entire studies. The Declaration of Helsinki and other consensus documents have outlined the ethical obligation to make results from clinical research publicly available. 

Reporting bias occurs when the dissemination of research findings is influenced by the nature and direction of the results, for instance in systematic reviews. Positive results is a commonly used term to describe a study finding that one intervention is better than another. 

Various attempts have been made to overcome the effects of the reporting biases, including statistical adjustments to the results of published studies. None of these approaches has proved satisfactory, however, and there is increasing acceptance that reporting biases must be tackled by establishing registers of controlled trials and by promoting good publication practice. Until these problems have been addressed, estimates of the effects of treatments based on published evidence may be biased.

Case study

Litigation brought upon by consumers and health insurers against Pfizer for the fraudulent sales practices in marketing of the drug gabapentin in 2004 revealed a comprehensive publication strategy that employed elements of reporting bias. Spin was used to put emphasis on favorable findings that favored gabapentin, and also to explain away unfavorable findings towards the drug. In this case, favorable secondary outcomes became the focus over the original primary outcome, which was unfavorable. Other changes found in outcome reporting include the introduction of a new primary outcome, failure to distinguish between primary and secondary outcomes, and failure to report one or more protocol-defined primary outcomes.

The decision to publish certain findings in certain journals is another strategy. Trials with statistically significant findings were generally published in academic journals with higher circulation more often than trials with nonsignificant findings. Timing of publication results of trials was influenced, in that the company tried to optimize the timing between the release of two studies. Trials with nonsignificant findings were found to be published in a staggered fashion, as to not have two consecutive trials published without salient findings. Ghost authorship was also an issue, where professional medical writers who drafted the published reports were not properly acknowledged.

Fallout from this case is still being settled by Pfizer in 2014, 10 years after the initial litigation.

Types of reporting bias

Publication bias

The publication or nonpublication of research findings, depending on the nature and direction of the results. Although medical writers have acknowledged the problem of reporting biases for over a century, it was not until the second half of the 20th century that researchers began to investigate the sources and size of the problem of reporting biases.

Over the past two decades, evidence has accumulated that failure to publish research studies, including clinical trials testing intervention effectiveness, is pervasive. Almost all failure to publish is due to failure of the investigator to submit; only a small proportion of studies are not published because of rejection by journals.

The most direct evidence of publication bias in the medical field comes from follow-up studies of research projects identified at the time of funding or ethics approval. These studies have shown that “positive findings” is the principal factor associated with subsequent publication: researchers say that the reason they don’t write up and submit reports of their research for publication is usually because they are “not interested” in the results (editorial rejection by journals is a rare cause of failure to publish). 

Even those investigators who have initially published their results as conference abstracts are less likely to publish their findings in full unless the results are “significant”. This is a problem because data presented in abstracts are frequently preliminary or interim results and thus may not be reliable representations of what was found once all data were collected and analyzed. In addition, abstracts are often not accessible to the public through journals, MEDLINE, or easily accessed databases. Many are published in conference programs, conference proceedings, or on CD-ROM, and are made available only to meeting registrants. 

The main factor associated with failure to publish is negative or null findings. Controlled trials that are eventually reported in full are published more rapidly if their results are positive. Publication bias leads to overestimates of treatment effect in meta-analyses, which in turn can lead doctors and decision makers to believe a treatment is more useful than it is. 

It is now well-established that publication bias is associated with the source of funding for the study.

Time lag bias

The rapid or delayed publication of research findings, depending on the nature and direction of the results. In a systematic review of the literature, Hopewell and her colleagues found that overall, trials with “positive results” (statistically significant in favor of the experimental arm) were published about a year sooner than trials with “null or negative results” (not statistically significant or statistically significant in favor of the control arm).

Multiple (duplicate) publication bias

The multiple or singular publication of research findings, depending on the nature and direction of the results. Investigators may also publish the same findings multiple times using a variety of patterns of “duplicate” publication. Many duplicates are published in journal supplements, potentially difficult to access literature. Positive results appear to be published more often in duplicate, which can lead to overestimates of a treatment effect.

Location bias

The publication of research findings in journals with different ease of access or levels of indexing in standard databases, depending on the nature and direction of results. There is also evidence that, compared to negative or null results, statistically significant results are on average published in journals with greater impact factors, and that publication in the mainstream (non grey) literature is associated with an overall greater treatment effect compared to the grey literature.

Citation bias

The citation or non-citation of research findings, depending on the nature and direction of the results. Authors tend to cite positive results over negative or null results, and this has been established over a broad cross section of topics. Differential citation may lead to a perception in the community that an intervention is effective when it is not, and it may lead to over-representation of positive findings in systematic reviews if those left uncited are difficult to locate. 

Selective pooling of results in a meta-analysis is a form of citation bias that is particularly insidious in its potential to influence knowledge. To minimize bias, pooling of results from similar but separate studies requires an exhaustive search for all relevant studies. That is, a meta-analysis (or pooling of data from multiple studies) must always have emerged from a systematic review (not a selective review of the literature), even though a systematic review does not always have an associated meta-analysis.

Language bias

The publication of research findings in a particular language, depending on the nature and direction of the results. There is longstanding question about whether there is a language bias such that investigators choose to publish their negative findings in non-English language journals and reserve their positive findings for English language journals. Some research has shown that language restrictions in systematic reviews can change the results of the review and in other cases, authors have not found that such a bias exists.

Knowledge reporting bias

The frequency with which people write about actions, outcomes, or properties is not a reflection of real-world frequencies or the degree to which a property is characteristic of a class of individuals. People write about only some parts of the world around them; much of the information is left unsaid.

Outcome reporting bias

The selective reporting of some outcomes but not others, depending on the nature and direction of the results. A study may be published in full, but pre-specified outcomes omitted or misrepresented. Efficacy outcomes that are statistically significant have a higher chance of being fully published compared to those that are not statistically significant. 

Selective reporting of suspected or confirmed adverse treatment effects is an area for particular concern because of the potential for patient harm. In a study of adverse drug events submitted to Scandinavian drug licensing authorities, reports for published studies were less likely than unpublished studies to record adverse events (for example, 56 vs 77% respectively for Finnish trials involving psychotropic drugs). Recent attention in the lay and scientific media on failure to accurately report adverse events for drugs (e.g., selective serotonin uptake inhibitors, rosiglitazone, rofecoxib) has resulted in additional publications, too numerous to review, indicating substantial selective outcome reporting (mainly suppression) of known or suspected adverse events.

Meta-analysis

From Wikipedia, the free encyclopedia

Graphical summary of a meta analysis of over 1,000 cases of diffuse intrinsic pontine glioma and other pediatric gliomas, in which information about the mutations involved as well as generic outcomes were distilled from the underlying primary literature.
 
A meta-analysis is a statistical analysis that combines the results of multiple scientific studies. Meta-analysis can be performed when there are multiple scientific studies addressing the same question, with each individual study reporting measurements that are expected to have some degree of error. The aim then is to use approaches from statistics to derive a pooled estimate closest to the unknown common truth based on how this error is perceived. Existing methods for meta-analysis yield a weighted average from the results of the individual studies, and what differs is the manner in which these weights are allocated and also the manner in which the uncertainty is computed around the point estimate thus generated. In addition to providing an estimate of the unknown common truth, meta-analysis has the capacity to contrast results from different studies and identify patterns among study results, sources of disagreement among those results, or other interesting relationships that may come to light in the context of multiple studies.

A key benefit of this approach is the aggregation of information leading to a higher statistical power and more robust point estimate than is possible from the measure derived from any individual study. However, in performing a meta-analysis, an investigator must make choices which can affect the results, including deciding how to search for studies, selecting studies based on a set of objective criteria, dealing with incomplete data, analyzing the data, and accounting for or choosing not to account for publication bias.

Meta-analyses are often, but not always, important components of a systematic review procedure. For instance, a meta-analysis may be conducted on several clinical trials of a medical treatment, in an effort to obtain a better understanding of how well the treatment works. Here it is convenient to follow the terminology used by the Cochrane Collaboration, and use "meta-analysis" to refer to statistical methods of combining evidence, leaving other aspects of 'research synthesis' or 'evidence synthesis', such as combining information from qualitative studies, for the more general context of systematic reviews. A meta-analysis is a secondary source.

History

The historical roots of meta-analysis can be traced back to 17th century studies of astronomy, while a paper published in 1904 by the statistician Karl Pearson in the British Medical Journal which collated data from several studies of typhoid inoculation is seen as the first time a meta-analytic approach was used to aggregate the outcomes of multiple clinical studies. The first meta-analysis of all conceptually identical experiments concerning a particular research issue, and conducted by independent researchers, has been identified as the 1940 book-length publication Extrasensory Perception After Sixty Years, authored by Duke University psychologists J. G. Pratt, J. B. Rhine, and associates. This encompassed a review of 145 reports on ESP experiments published from 1882 to 1939, and included an estimate of the influence of unpublished papers on the overall effect (the file-drawer problem). Although meta-analysis is widely used in epidemiology and evidence-based medicine today, a meta-analysis of a medical treatment was not published until 1955. In the 1970s, more sophisticated analytical techniques were introduced in educational research, starting with the work of Gene V. Glass, Frank L. Schmidt and John E. Hunter

The term "meta-analysis" was coined in 1976 by the statistician Gene V. Glass, who stated "my major interest currently is in what we have come to call ...the meta-analysis of research. The term is a bit grand, but it is precise and apt ... Meta-analysis refers to the analysis of analyses". Although this led to him being widely recognized as the modern founder of the method, the methodology behind what he termed "meta-analysis" predates his work by several decades. The statistical theory surrounding meta-analysis was greatly advanced by the work of Nambury S. Raju, Larry V. Hedges, Harris Cooper, Ingram Olkin, John E. Hunter, Jacob Cohen, Thomas C. Chalmers, Robert Rosenthal, Frank L. Schmidt, and Douglas G. Bonett.

Advantages

Conceptually, a meta-analysis uses a statistical approach to combine the results from multiple studies in an effort to increase power (over individual studies), improve estimates of the size of the effect and/or to resolve uncertainty when reports disagree. A meta-analysis is a statistical overview of the results from one or more systematic reviews. Basically, it produces a weighted average of the included study results and this approach has several advantages:
  • Results can be generalized to a larger population
  • The precision and accuracy of estimates can be improved as more data is used. This, in turn, may increase the statistical power to detect an effect
  • Inconsistency of results across studies can be quantified and analyzed. For instance, inconsistency may arise from sampling error, or study results (partially) influenced by differences between study protocols
  • Hypothesis testing can be applied on summary estimates
  • Moderators can be included to explain variation between studies
  • The presence of publication bias can be investigated

Steps in a meta-analysis

A meta-analysis is usually preceded by a systematic review, as this allows identification and critical appraisal of all the relevant evidence (thereby limiting the risk of bias in summary estimates). The general steps are then as follows:
  1. Formulation of the research question, e.g. using the PICO model (Population, Intervention, Comparison, Outcome).
  2. Search of literature
  3. Selection of studies ('incorporation criteria')
    1. Based on quality criteria, e.g. the requirement of randomization and blinding in a clinical trial
    2. Selection of specific studies on a well-specified subject, e.g. the treatment of breast cancer.
    3. Decide whether unpublished studies are included to avoid publication bias (file drawer problem)
  4. Decide which dependent variables or summary measures are allowed. For instance, when considering a meta-analysis of published (aggregate) data:
    • Differences (discrete data)
    • Means (continuous data)
    • Hedges' g is a popular summary measure for continuous data that is standardized in order to eliminate scale differences, but it incorporates an index of variation between groups:
      1. in which is the treatment mean, is the control mean, the pooled variance.
  5. Selection of a meta-analysis model, e.g. fixed effect or random effects meta-analysis.
  6. Examine sources of between-study heterogeneity, e.g. using subgroup analysis or meta-regression.
Formal guidance for the conduct and reporting of meta-analyses is provided by the Cochrane Handbook.
For reporting guidelines, see the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement.[14]

Methods and assumptions

Approaches

In general, two types of evidence can be distinguished when performing a meta-analysis: individual participant data (IPD), and aggregate data (AD). The aggregate data can be direct or indirect. 

AD is more commonly available (e.g. from the literature) and typically represents summary estimates such as odds ratios or relative risks. This can be directly synthesized across conceptually similar studies using several approaches (see below). On the other hand, indirect aggregate data measures the effect of two treatments that were each compared against a similar control group in a meta-analysis. For example, if treatment A and treatment B were directly compared vs placebo in separate meta-analyses, we can use these two pooled results to get an estimate of the effects of A vs B in an indirect comparison as effect A vs Placebo minus effect B vs Placebo. 

IPD evidence represents raw data as collected by the study centers. This distinction has raised the need for different meta-analytic methods when evidence synthesis is desired, and has led to the development of one-stage and two-stage methods. In one-stage methods the IPD from all studies are modeled simultaneously whilst accounting for the clustering of participants within studies. Two-stage methods first compute summary statistics for AD from each study and then calculate overall statistics as a weighted average of the study statistics. By reducing IPD to AD, two-stage methods can also be applied when IPD is available; this makes them an appealing choice when performing a meta-analysis. Although it is conventionally believed that one-stage and two-stage methods yield similar results, recent studies have shown that they may occasionally lead to different conclusions.

Statistical models for aggregate data

Direct evidence: Models incorporating study effects only

Fixed effects model
The fixed effect model provides a weighted average of a series of study estimates. The inverse of the estimates' variance is commonly used as study weight, so that larger studies tend to contribute more than smaller studies to the weighted average. Consequently, when studies within a meta-analysis are dominated by a very large study, the findings from smaller studies are practically ignored. Most importantly, the fixed effects model assumes that all included studies investigate the same population, use the same variable and outcome definitions, etc. This assumption is typically unrealistic as research is often prone to several sources of heterogeneity; e.g. treatment effects may differ according to locale, dosage levels, study conditions, ...
Random effects model
A common model used to synthesize heterogeneous research is the random effects model of meta-analysis. This is simply the weighted average of the effect sizes of a group of studies. The weight that is applied in this process of weighted averaging with a random effects meta-analysis is achieved in two steps:
  1. Step 1: Inverse variance weighting
  2. Step 2: Un-weighting of this inverse variance weighting by applying a random effects variance component (REVC) that is simply derived from the extent of variability of the effect sizes of the underlying studies.
This means that the greater this variability in effect sizes (otherwise known as heterogeneity), the greater the un-weighting and this can reach a point when the random effects meta-analysis result becomes simply the un-weighted average effect size across the studies. At the other extreme, when all effect sizes are similar (or variability does not exceed sampling error), no REVC is applied and the random effects meta-analysis defaults to simply a fixed effect meta-analysis (only inverse variance weighting). 

The extent of this reversal is solely dependent on two factors:
  1. Heterogeneity of precision
  2. Heterogeneity of effect size
Since neither of these factors automatically indicates a faulty larger study or more reliable smaller studies, the re-distribution of weights under this model will not bear a relationship to what these studies actually might offer. Indeed, it has been demonstrated that redistribution of weights is simply in one direction from larger to smaller studies as heterogeneity increases until eventually all studies have equal weight and no more redistribution is possible. Another issue with the random effects model is that the most commonly used confidence intervals generally do not retain their coverage probability above the specified nominal level and thus substantially underestimate the statistical error and are potentially overconfident in their conclusions. Several fixes have been suggested but the debate continues on. A further concern is that the average treatment effect can sometimes be even less conservative compared to the fixed effect model and therefore misleading in practice. One interpretational fix that has been suggested is to create a prediction interval around the random effects estimate to portray the range of possible effects in practice. However, an assumption behind the calculation of such a prediction interval is that trials are considered more or less homogeneous entities and that included patient populations and comparator treatments should be considered exchangeable and this is usually unattainable in practice.

The most widely used method to estimate between studies variance (REVC) is the DerSimonian-Laird (DL) approach. Several advanced iterative (and computationally expensive) techniques for computing the between studies variance exist (such as maximum likelihood, profile likelihood and restricted maximum likelihood methods) and random effects models using these methods can be run in Stata with the metaan command. The metaan command must be distinguished from the classic metan (single "a") command in Stata that uses the DL estimator. These advanced methods have also been implemented in a free and easy to use Microsoft Excel add-on, MetaEasy. However, a comparison between these advanced methods and the DL method of computing the between studies variance demonstrated that there is little to gain and DL is quite adequate in most scenarios.

However, most meta-analyses include between 2 and 4 studies and such a sample is more often than not inadequate to accurately estimate heterogeneity. Thus it appears that in small meta-analyses, an incorrect zero between study variance estimate is obtained, leading to a false homogeneity assumption. Overall, it appears that heterogeneity is being consistently underestimated in meta-analyses and sensitivity analyses in which high heterogeneity levels are assumed could be informative. These random effects models and software packages mentioned above relate to study-aggregate meta-analyses and researchers wishing to conduct individual patient data (IPD) meta-analyses need to consider mixed-effects modelling approaches.
IVhet model
Doi & Barendregt working in collaboration with Khan, Thalib and Williams (from the University of Queensland, University of Southern Queensland and Kuwait University), have created an inverse variance quasi likelihood based alternative (IVhet) to the random effects (RE) model for which details are available online. This was incorporated into MetaXL version 2.0, a free Microsoft excel add-in for meta-analysis produced by Epigear International Pty Ltd, and made available on 5 April 2014. The authors state that a clear advantage of this model is that it resolves the two main problems of the random effects model. The first advantage of the IVhet model is that coverage remains at the nominal (usually 95%) level for the confidence interval unlike the random effects model which drops in coverage with increasing heterogeneity. The second advantage is that the IVhet model maintains the inverse variance weights of individual studies, unlike the RE model which gives small studies more weight (and therefore larger studies less) with increasing heterogeneity. When heterogeneity becomes large, the individual study weights under the RE model become equal and thus the RE model returns an arithmetic mean rather than a weighted average. This side-effect of the RE model does not occur with the IVhet model which thus differs from the RE model estimate in two perspectives: Pooled estimates will favor larger trials (as opposed to penalizing larger trials in the RE model) and will have a confidence interval that remains within the nominal coverage under uncertainty (heterogeneity). Doi & Barendregt suggest that while the RE model provides an alternative method of pooling the study data, their simulation results demonstrate that using a more specified probability model with untenable assumptions, as with the RE model, does not necessarily provide better results. The latter study also reports that the IVhet model resolves the problems related to underestimation of the statistical error, poor coverage of the confidence interval and increased MSE seen with the random effects model and the authors conclude that researchers should henceforth abandon use of the random effects model in meta-analysis. While their data is compelling, the ramifications (in terms of the magnitude of spuriously positive results within the Cochrane database) are huge and thus accepting this conclusion requires careful independent confirmation. The availability of a free software (MetaXL) that runs the IVhet model (and all other models for comparison) facilitates this for the research community.

Direct evidence: Models incorporating additional information

Quality effects model
Doi and Thalib originally introduced the quality effects model. They introduced a new approach to adjustment for inter-study variability by incorporating the contribution of variance due to a relevant component (quality) in addition to the contribution of variance due to random error that is used in any fixed effects meta-analysis model to generate weights for each study. The strength of the quality effects meta-analysis is that it allows available methodological evidence to be used over subjective random effects, and thereby helps to close the damaging gap which has opened up between methodology and statistics in clinical research. To do this a synthetic bias variance is computed based on quality information to adjust inverse variance weights and the quality adjusted weight of the ith study is introduced. These adjusted weights are then used in meta-analysis. In other words, if study i is of good quality and other studies are of poor quality, a proportion of their quality adjusted weights is mathematically redistributed to study i giving it more weight towards the overall effect size. As studies become increasingly similar in terms of quality, re-distribution becomes progressively less and ceases when all studies are of equal quality (in the case of equal quality, the quality effects model defaults to the IVhet model – see previous section). A recent evaluation of the quality effects model (with some updates) demonstrates that despite the subjectivity of quality assessment, the performance (MSE and true variance under simulation) is superior to that achievable with the random effects model. This model thus replaces the untenable interpretations that abound in the literature and a software is available to explore this method further.

Indirect evidence: Network meta-analysis methods

A network meta-analysis looks at indirect comparisons. In the image, A has been analyzed in relation to C and C has been analyzed in relation to b. However the relation between A and B is only known indirectly, and a network meta-analysis looks at such indirect evidence of differences between methods and interventions using statistical method.
 
Indirect comparison meta-analysis methods (also called network meta-analyses, in particular when multiple treatments are assessed simultaneously) generally use two main methodologies. First, is the Bucher method which is a single or repeated comparison of a closed loop of three-treatments such that one of them is common to the two studies and forms the node where the loop begins and ends. Therefore, multiple two-by-two comparisons (3-treatment loops) are needed to compare multiple treatments. This methodology requires that trials with more than two arms have two arms only selected as independent pair-wise comparisons are required. The alternative methodology uses complex statistical modelling to include the multiple arm trials and comparisons simultaneously between all competing treatments. These have been executed using Bayesian methods, mixed linear models and meta-regression approaches.
Bayesian framework
Specifying a Bayesian network meta-analysis model involves writing a directed acyclic graph (DAG) model for general-purpose Markov chain Monte Carlo (MCMC) software such as WinBUGS. In addition, prior distributions have to be specified for a number of the parameters, and the data have to be supplied in a specific format. Together, the DAG, priors, and data form a Bayesian hierarchical model. To complicate matters further, because of the nature of MCMC estimation, overdispersed starting values have to be chosen for a number of independent chains so that convergence can be assessed. Currently, there is no software that automatically generates such models, although there are some tools to aid in the process. The complexity of the Bayesian approach has limited usage of this methodology. Methodology for automation of this method has been suggested but requires that arm-level outcome data are available, and this is usually unavailable. Great claims are sometimes made for the inherent ability of the Bayesian framework to handle network meta-analysis and its greater flexibility. However, this choice of implementation of framework for inference, Bayesian or frequentist, may be less important than other choices regarding the modeling of effects (see discussion on models above).
Frequentist multivariate framework
On the other hand, the frequentist multivariate methods involve approximations and assumptions that are not stated explicitly or verified when the methods are applied (see discussion on meta-analysis models above). For example, the mvmeta package for Stata enables network meta-analysis in a frequentist framework. However, if there is no common comparator in the network, then this has to be handled by augmenting the dataset with fictional arms with high variance, which is not very objective and requires a decision as to what constitutes a sufficiently high variance. The other issue is use of the random effects model in both this frequentist framework and the Bayesian framework. Senn advises analysts to be cautious about interpreting the 'random effects' analysis since only one random effect is allowed for but one could envisage many. Senn goes on to say that it is rather naıve, even in the case where only two treatments are being compared to assume that random-effects analysis accounts for all uncertainty about the way effects can vary from trial to trial. Newer models of meta-analysis such as those discussed above would certainly help alleviate this situation and have been implemented in the next framework.
Generalized pairwise modelling framework
An approach that has been tried since the late 1990s is the implementation of the multiple three-treatment closed-loop analysis. This has not been popular because the process rapidly becomes overwhelming as network complexity increases. Development in this area was then abandoned in favor of the Bayesian and multivariate frequentist methods which emerged as alternatives. Very recently, automation of the three-treatment closed loop method has been developed for complex networks by some researchers as a way to make this methodology available to the mainstream research community. This proposal does restrict each trial to two interventions, but also introduces a workaround for multiple arm trials: a different fixed control node can be selected in different runs. It also utilizes robust meta-analysis methods so that many of the problems highlighted above are avoided. Further research around this framework is required to determine if this is indeed superior to the Bayesian or multivariate frequentist frameworks. Researchers willing to try this out have access to this framework through a free software.
Tailored meta-analysis
Another form of additional information comes from the intended setting. If the target setting for applying the meta-analysis results is known then it may be possible to use data from the setting to tailor the results thus producing a ‘tailored meta-analysis’. This has been used in test accuracy meta-analyses, where empirical knowledge of the test positive rate and the prevalence have been used to derive a region in Receiver Operating Characteristic (ROC) space known as an ‘applicable region’. Studies are then selected for the target setting based on comparison with this region and aggregated to produce a summary estimate which is tailored to the target setting.

Validation of meta-analysis results

The meta-analysis estimate represents a weighted average across studies and when there is heterogeneity this may result in the summary estimate not being representative of individual studies. Qualitative appraisal of the primary studies using established tools can uncover potential biases, but does not quantify the aggregate effect of these biases on the summary estimate. Although the meta-analysis result could be compared with an independent prospective primary study, such external validation is often impractical. This has led to the development of methods that exploit a form of leave-one-out cross validation, sometimes referred to as internal-external cross validation (IOCV). Here each of the k included studies in turn is omitted and compared with the summary estimate derived from aggregating the remaining k- 1 studies. A general validation statistic, Vn based on IOCV has been developed to measure the statistical validity of meta-analysis results. For test accuracy and prediction, particularly when there are multivariate effects, other approaches which seek to estimate the prediction error have also been proposed.

Challenges

A meta-analysis of several small studies does not always predict the results of a single large study. Some have argued that a weakness of the method is that sources of bias are not controlled by the method: a good meta-analysis cannot correct for poor design or bias in the original studies. This would mean that only methodologically sound studies should be included in a meta-analysis, a practice called 'best evidence synthesis'. Other meta-analysts would include weaker studies, and add a study-level predictor variable that reflects the methodological quality of the studies to examine the effect of study quality on the effect size. However, others have argued that a better approach is to preserve information about the variance in the study sample, casting as wide a net as possible, and that methodological selection criteria introduce unwanted subjectivity, defeating the purpose of the approach.

Publication bias: the file drawer problem

A funnel plot expected without the file drawer problem. The largest studies converge at the tip while smaller studies show more or less symmetrical scatter at the base
 
A funnel plot expected with the file drawer problem. The largest studies still cluster around the tip, but the bias against publishing negative studies has caused the smaller studies as a whole to have an unjustifiably favorable result to the hypothesis
 
Another potential pitfall is the reliance on the available body of published studies, which may create exaggerated outcomes due to publication bias, as studies which show negative results or insignificant results are less likely to be published. For example, pharmaceutical companies have been known to hide negative studies and researchers may have overlooked unpublished studies such as dissertation studies or conference abstracts that did not reach publication. This is not easily solved, as one cannot know how many studies have gone unreported.

This file drawer problem (characterized by negative or non-significant results being tucked away in a cabinet), can result in a biased distribution of effect sizes thus creating a serious base rate fallacy, in which the significance of the published studies is overestimated, as other studies were either not submitted for publication or were rejected. This should be seriously considered when interpreting the outcomes of a meta-analysis.

The distribution of effect sizes can be visualized with a funnel plot which (in its most common version) is a scatter plot of standard error versus the effect size. It makes use of the fact that the smaller studies (thus larger standard errors) have more scatter of the magnitude of effect (being less precise) while the larger studies have less scatter and form the tip of the funnel. If many negative studies were not published, the remaining positive studies give rise to a funnel plot in which the base is skewed to one side (asymmetry of the funnel plot). In contrast, when there is no publication bias, the effect of the smaller studies has no reason to be skewed to one side and so a symmetric funnel plot results. This also means that if no publication bias is present, there would be no relationship between standard error and effect size. A negative or positive relation between standard error and effect size would imply that smaller studies that found effects in one direction only were more likely to be published and/or to be submitted for publication.

Apart from the visual funnel plot, statistical methods for detecting publication bias have also been proposed. These are controversial because they typically have low power for detection of bias, but also may make false positives under some circumstances. For instance small study effects (biased smaller studies), wherein methodological differences between smaller and larger studies exist, may cause asymmetry in effect sizes that resembles publication bias. However, small study effects may be just as problematic for the interpretation of meta-analyses, and the imperative is on meta-analytic authors to investigate potential sources of bias. 

A Tandem Method for analyzing publication bias has been suggested for cutting down false positive error problems. This Tandem method consists of three stages. Firstly, one calculates Orwin's fail-safe N, to check how many studies should be added in order to reduce the test statistic to a trivial size. If this number of studies is larger than the number of studies used in the meta-analysis, it is a sign that there is no publication bias, as in that case, one needs a lot of studies to reduce the effect size. Secondly, one can do an Egger's regression test, which tests whether the funnel plot is symmetrical. As mentioned before: a symmetrical funnel plot is a sign that there is no publication bias, as the effect size and sample size are not dependent. Thirdly, one can do the trim-and-fill method, which imputes data if the funnel plot is asymmetrical. 

The problem of publication bias is not trivial as it is suggested that 25% of meta-analyses in the psychological sciences may have suffered from publication bias. However, low power of existing tests and problems with the visual appearance of the funnel plot remain an issue, and estimates of publication bias may remain lower than what truly exists. 

Most discussions of publication bias focus on journal practices favoring publication of statistically significant findings. However, questionable research practices, such as reworking statistical models until significance is achieved, may also favor statistically significant findings in support of researchers' hypotheses.

Problems related to studies not reporting non-statistically significant effects

Studies often do not report the effects when they do not reach statistical significance[citation needed]. For example, they may simply say that the groups did not show statistically significant differences, without report any other information (e.g. a statistic or p-value). Exclusion of these studies would lead to a situation similar to publication bias, but their inclusion (assuming null effects) would also bias the meta-analysis. MetaNSUE, a new method created by Joaquim Radua, has shown to allow researchers to include unbiasedly these studies. Its steps are as follows:

Problems related to the statistical approach

Other weaknesses are that it has not been determined if the statistically most accurate method for combining results is the fixed, IVhet, random or quality effect models, though the criticism against the random effects model is mounting because of the perception that the new random effects (used in meta-analysis) are essentially formal devices to facilitate smoothing or shrinkage and prediction may be impossible or ill-advised. The main problem with the random effects approach is that it uses the classic statistical thought of generating a "compromise estimator" that makes the weights close to the naturally weighted estimator if heterogeneity across studies is large but close to the inverse variance weighted estimator if the between study heterogeneity is small. However, what has been ignored is the distinction between the model we choose to analyze a given dataset, and the mechanism by which the data came into being. A random effect can be present in either of these roles, but the two roles are quite distinct. There's no reason to think the analysis model and data-generation mechanism (model) are similar in form, but many sub-fields of statistics have developed the habit of assuming, for theory and simulations, that the data-generation mechanism (model) is identical to the analysis model we choose (or would like others to choose). As a hypothesized mechanisms for producing the data, the random effect model for meta-analysis is silly and it is more appropriate to think of this model as a superficial description and something we choose as an analytical tool – but this choice for meta-analysis may not work because the study effects are a fixed feature of the respective meta-analysis and the probability distribution is only a descriptive tool.

Problems arising from agenda-driven bias

The most severe fault in meta-analysis often occurs when the person or persons doing the meta-analysis have an economic, social, or political agenda such as the passage or defeat of legislation. People with these types of agendas may be more likely to abuse meta-analysis due to personal bias. For example, researchers favorable to the author's agenda are likely to have their studies cherry-picked while those not favorable will be ignored or labeled as "not credible". In addition, the favored authors may themselves be biased or paid to produce results that support their overall political, social, or economic goals in ways such as selecting small favorable data sets and not incorporating larger unfavorable data sets. The influence of such biases on the results of a meta-analysis is possible because the methodology of meta-analysis is highly malleable.

A 2011 study done to disclose possible conflicts of interests in underlying research studies used for medical meta-analyses reviewed 29 meta-analyses and found that conflicts of interests in the studies underlying the meta-analyses were rarely disclosed. The 29 meta-analyses included 11 from general medicine journals, 15 from specialty medicine journals, and three from the Cochrane Database of Systematic Reviews. The 29 meta-analyses reviewed a total of 509 randomized controlled trials (RCTs). Of these, 318 RCTs reported funding sources, with 219 (69%) receiving funding from industry[clarification needed]. Of the 509 RCTs, 132 reported author conflict of interest disclosures, with 91 studies (69%) disclosing one or more authors having financial ties to industry. The information was, however, seldom reflected in the meta-analyses. Only two (7%) reported RCT funding sources and none reported RCT author-industry ties. The authors concluded "without acknowledgment of COI due to industry funding or author industry financial ties from RCTs included in meta-analyses, readers' understanding and appraisal of the evidence from the meta-analysis may be compromised."

For example, in 1998, a US federal judge found that the United States Environmental Protection Agency had abused the meta-analysis process to produce a study claiming cancer risks to non-smokers from environmental tobacco smoke (ETS) with the intent to influence policy makers to pass smoke-free–workplace laws. The judge found that:
EPA's study selection is disturbing. First, there is evidence in the record supporting the accusation that EPA "cherry picked" its data. Without criteria for pooling studies into a meta-analysis, the court cannot determine whether the exclusion of studies likely to disprove EPA's a priori hypothesis was coincidence or intentional. Second, EPA's excluding nearly half of the available studies directly conflicts with EPA's purported purpose for analyzing the epidemiological studies and conflicts with EPA's Risk Assessment Guidelines. See ETS Risk Assessment at 4-29 ("These data should also be examined in the interest of weighing all the available evidence, as recommended by EPA's carcinogen risk assessment guidelines (U.S. EPA, 1986a) (emphasis added)). Third, EPA's selective use of data conflicts with the Radon Research Act. The Act states EPA's program shall "gather data and information on all aspects of indoor air quality" (Radon Research Act § 403(a)(1)) (emphasis added).
As a result of the abuse, the court vacated Chapters 1–6 of and the Appendices to EPA's "Respiratory Health Effects of Passive Smoking: Lung Cancer and other Disorders".

Applications in modern science

Modern statistical meta-analysis does more than just combine the effect sizes of a set of studies using a weighted average. It can test if the outcomes of studies show more variation than the variation that is expected because of the sampling of different numbers of research participants. Additionally, study characteristics such as measurement instrument used, population sampled, or aspects of the studies' design can be coded and used to reduce variance of the estimator (see statistical models above). Thus some methodological weaknesses in studies can be corrected statistically. Other uses of meta-analytic methods include the development and validation of clinical prediction models, where meta-analysis may be used to combine individual participant data from different research centers and to assess the model's generalisability, or even to aggregate existing prediction models.

Meta-analysis can be done with single-subject design as well as group research designs. This is important because much research has been done with single-subject research designs. Considerable dispute exists for the most appropriate meta-analytic technique for single subject research.

Meta-analysis leads to a shift of emphasis from single studies to multiple studies. It emphasizes the practical importance of the effect size instead of the statistical significance of individual studies. This shift in thinking has been termed "meta-analytic thinking". The results of a meta-analysis are often shown in a forest plot

Results from studies are combined using different approaches. One approach frequently used in meta-analysis in health care research is termed 'inverse variance method'. The average effect size across all studies is computed as a weighted mean, whereby the weights are equal to the inverse variance of each study's effect estimator. Larger studies and studies with less random variation are given greater weight than smaller studies. Other common approaches include the Mantel–Haenszel method and the Peto method.

Seed-based d mapping (formerly signed differential mapping, SDM) is a statistical technique for meta-analyzing studies on differences in brain activity or structure which used neuroimaging techniques such as fMRI, VBM or PET. 

Different high throughput techniques such as microarrays have been used to understand Gene expression. MicroRNA expression profiles have been used to identify differentially expressed microRNAs in particular cell or tissue type or disease conditions or to check the effect of a treatment. A meta-analysis of such expression profiles was performed to derive novel conclusions and to validate the known findings.

Operator (computer programming)

From Wikipedia, the free encyclopedia https://en.wikipedia.org/wiki/Operator_(computer_programmin...