Search This Blog

Friday, November 8, 2024

Robust statistics

From Wikipedia, the free encyclopedia
https://en.wikipedia.org/wiki/Robust_statistics

Robust statistics are statistics that maintain their properties even if the underlying distributional assumptions are incorrect. Robust statistical methods have been developed for many common problems, such as estimating location, scale, and regression parameters. One motivation is to produce statistical methods that are not unduly affected by outliers. Another motivation is to provide methods with good performance when there are small departures from a parametric distribution. For example, robust methods work well for mixtures of two normal distributions with different standard deviations; under this model, non-robust methods like a t-test work poorly.

Introduction

Robust statistics seek to provide methods that emulate popular statistical methods, but are not unduly affected by outliers or other small departures from model assumptions. In statistics, classical estimation methods rely heavily on assumptions that are often not met in practice. In particular, it is often assumed that the data errors are normally distributed, at least approximately, or that the central limit theorem can be relied on to produce normally distributed estimates. Unfortunately, when there are outliers in the data, classical estimators often have very poor performance, when judged using the breakdown point and the influence function described below.

The practical effect of problems seen in the influence function can be studied empirically by examining the sampling distribution of proposed estimators under a mixture model, where one mixes in a small amount (1–5% is often sufficient) of contamination. For instance, one may use a mixture of 95% a normal distribution, and 5% a normal distribution with the same mean but significantly higher standard deviation (representing outliers).

Robust parametric statistics can proceed in two ways:

  • by designing estimators so that a pre-selected behaviour of the influence function is achieved
  • by replacing estimators that are optimal under the assumption of a normal distribution with estimators that are optimal for, or at least derived for, other distributions; for example, using the t-distribution with low degrees of freedom (high kurtosis) or with a mixture of two or more distributions.

Robust estimates have been studied for the following problems:

Definition

There are various definitions of a "robust statistic". Strictly speaking, a robust statistic is resistant to errors in the results, produced by deviations from assumptions (e.g., of normality). This means that if the assumptions are only approximately met, the robust estimator will still have a reasonable efficiency, and reasonably small bias, as well as being asymptotically unbiased, meaning having a bias tending towards 0 as the sample size tends towards infinity.

Usually, the most important case is distributional robustness - robustness to breaking of the assumptions about the underlying distribution of the data. Classical statistical procedures are typically sensitive to "longtailedness" (e.g., when the distribution of the data has longer tails than the assumed normal distribution). This implies that they will be strongly affected by the presence of outliers in the data, and the estimates they produce may be heavily distorted if there are extreme outliers in the data, compared to what they would be if the outliers were not included in the data.

By contrast, more robust estimators that are not so sensitive to distributional distortions such as longtailedness are also resistant to the presence of outliers. Thus, in the context of robust statistics, distributionally robust and outlier-resistant are effectively synonymous. For one perspective on research in robust statistics up to 2000, see Portnoy & He (2000).

Some experts prefer the term resistant statistics for distributional robustness, and reserve 'robustness' for non-distributional robustness, e.g., robustness to violation of assumptions about the probability model or estimator, but this is a minority usage. Plain 'robustness' to mean 'distributional robustness' is common.

When considering how robust an estimator is to the presence of outliers, it is useful to test what happens when an extreme outlier is added to the dataset, and to test what happens when an extreme outlier replaces one of the existing data points, and then to consider the effect of multiple additions or replacements.

Examples

The mean is not a robust measure of central tendency. If the dataset is, e.g., the values {2,3,5,6,9}, then if we add another datapoint with value -1000 or +1000 to the data, the resulting mean will be very different from the mean of the original data. Similarly, if we replace one of the values with a datapoint of value -1000 or +1000 then the resulting mean will be very different from the mean of the original data.

The median is a robust measure of central tendency. Taking the same dataset {2,3,5,6,9}, if we add another datapoint with value -1000 or +1000 then the median will change slightly, but it will still be similar to the median of the original data. If we replace one of the values with a data point of value -1000 or +1000 then the resulting median will still be similar to the median of the original data.

Described in terms of breakdown points, the median has a breakdown point of 50%, meaning that half the points must be outliers before the median can be moved outside the range of the non-outliers, while the mean has a breakdown point of 0, as a single large observation can throw it off.

The median absolute deviation and interquartile range are robust measures of statistical dispersion, while the standard deviation and range are not.

Trimmed estimators and Winsorised estimators are general methods to make statistics more robust. L-estimators are a general class of simple statistics, often robust, while M-estimators are a general class of robust statistics, and are now the preferred solution, though they can be quite involved to calculate.

Speed-of-light data

Gelman et al. in Bayesian Data Analysis (2004) consider a data set relating to speed-of-light measurements made by Simon Newcomb. The data sets for that book can be found via the Classic data sets page, and the book's website contains more information on the data.

Although the bulk of the data looks to be more or less normally distributed, there are two obvious outliers. These outliers have a large effect on the mean, dragging it towards them, and away from the center of the bulk of the data. Thus, if the mean is intended as a measure of the location of the center of the data, it is, in a sense, biased when outliers are present.

Also, the distribution of the mean is known to be asymptotically normal due to the central limit theorem. However, outliers can make the distribution of the mean non-normal, even for fairly large data sets. Besides this non-normality, the mean is also inefficient in the presence of outliers and less variable measures of location are available.

Estimation of location

The plot below shows a density plot of the speed-of-light data, together with a rug plot (panel (a)). Also shown is a normal Q–Q plot (panel (b)). The outliers are visible in these plots.

Panels (c) and (d) of the plot show the bootstrap distribution of the mean (c) and the 10% trimmed mean (d). The trimmed mean is a simple, robust estimator of location that deletes a certain percentage of observations (10% here) from each end of the data, then computes the mean in the usual way. The analysis was performed in R and 10,000 bootstrap samples were used for each of the raw and trimmed means.

The distribution of the mean is clearly much wider than that of the 10% trimmed mean (the plots are on the same scale). Also whereas the distribution of the trimmed mean appears to be close to normal, the distribution of the raw mean is quite skewed to the left. So, in this sample of 66 observations, only 2 outliers cause the central limit theorem to be inapplicable.

Robust statistical methods, of which the trimmed mean is a simple example, seek to outperform classical statistical methods in the presence of outliers, or, more generally, when underlying parametric assumptions are not quite correct.

Whilst the trimmed mean performs well relative to the mean in this example, better robust estimates are available. In fact, the mean, median and trimmed mean are all special cases of M-estimators. Details appear in the sections below.

Estimation of scale

The outliers in the speed-of-light data have more than just an adverse effect on the mean; the usual estimate of scale is the standard deviation, and this quantity is even more badly affected by outliers because the squares of the deviations from the mean go into the calculation, so the outliers' effects are exacerbated.

The plots below show the bootstrap distributions of the standard deviation, the median absolute deviation (MAD) and the Rousseeuw–Croux (Qn) estimator of scale. The plots are based on 10,000 bootstrap samples for each estimator, with some Gaussian noise added to the resampled data (smoothed bootstrap). Panel (a) shows the distribution of the standard deviation, (b) of the MAD and (c) of Qn.

The distribution of standard deviation is erratic and wide, a result of the outliers. The MAD is better behaved, and Qn is a little bit more efficient than MAD. This simple example demonstrates that when outliers are present, the standard deviation cannot be recommended as an estimate of scale.

Manual screening for outliers

Traditionally, statisticians would manually screen data for outliers, and remove them, usually checking the source of the data to see whether the outliers were erroneously recorded. Indeed, in the speed-of-light example above, it is easy to see and remove the two outliers prior to proceeding with any further analysis. However, in modern times, data sets often consist of large numbers of variables being measured on large numbers of experimental units. Therefore, manual screening for outliers is often impractical.

Outliers can often interact in such a way that they mask each other. As a simple example, consider a small univariate data set containing one modest and one large outlier. The estimated standard deviation will be grossly inflated by the large outlier. The result is that the modest outlier looks relatively normal. As soon as the large outlier is removed, the estimated standard deviation shrinks, and the modest outlier now looks unusual.

This problem of masking gets worse as the complexity of the data increases. For example, in regression problems, diagnostic plots are used to identify outliers. However, it is common that once a few outliers have been removed, others become visible. The problem is even worse in higher dimensions.

Robust methods provide automatic ways of detecting, downweighting (or removing), and flagging outliers, largely removing the need for manual screening. Care must be taken; initial data showing the ozone hole first appearing over Antarctica were rejected as outliers by non-human screening.

Variety of applications

Although this article deals with general principles for univariate statistical methods, robust methods also exist for regression problems, generalized linear models, and parameter estimation of various distributions.

Measures of robustness

The basic tools used to describe and measure robustness are the breakdown point, the influence function and the sensitivity curve.

Breakdown point

Intuitively, the breakdown point of an estimator is the proportion of incorrect observations (e.g. arbitrarily large observations) an estimator can handle before giving an incorrect (e.g., arbitrarily large) result. Usually, the asymptotic (infinite sample) limit is quoted as the breakdown point, although the finite-sample breakdown point may be more useful. For example, given independent random variables and the corresponding realizations , we can use to estimate the mean. Such an estimator has a breakdown point of 0 (or finite-sample breakdown point of ) because we can make arbitrarily large just by changing any of .

The higher the breakdown point of an estimator, the more robust it is. Intuitively, we can understand that a breakdown point cannot exceed 50% because if more than half of the observations are contaminated, it is not possible to distinguish between the underlying distribution and the contaminating distribution Rousseeuw & Leroy (1987). Therefore, the maximum breakdown point is 0.5 and there are estimators which achieve such a breakdown point. For example, the median has a breakdown point of 0.5. The X% trimmed mean has a breakdown point of X%, for the chosen level of X. Huber (1981) and Maronna et al. (2019) contain more details. The level and the power breakdown points of tests are investigated in He, Simpson & Portnoy (1990).

Statistics with high breakdown points are sometimes called resistant statistics.

Example: speed-of-light data

In the speed-of-light example, removing the two lowest observations causes the mean to change from 26.2 to 27.75, a change of 1.55. The estimate of scale produced by the Qn method is 6.3. We can divide this by the square root of the sample size to get a robust standard error, and we find this quantity to be 0.78. Thus, the change in the mean resulting from removing two outliers is approximately twice the robust standard error.

The 10% trimmed mean for the speed-of-light data is 27.43. Removing the two lowest observations and recomputing gives 27.67. The trimmed mean is less affected by the outliers and has a higher breakdown point.

If we replace the lowest observation, −44, by −1000, the mean becomes 11.73, whereas the 10% trimmed mean is still 27.43. In many areas of applied statistics, it is common for data to be log-transformed to make them near symmetrical. Very small values become large negative when log-transformed, and zeroes become negatively infinite. Therefore, this example is of practical interest.

Empirical influence function

The empirical influence function is a measure of the dependence of the estimator on the value of any one of the points in the sample. It is a model-free measure in the sense that it simply relies on calculating the estimator again with a different sample. On the right is Tukey's biweight function, which, as we will later see, is an example of what a "good" (in a sense defined later on) empirical influence function should look like.

In mathematical terms, an influence function is defined as a vector in the space of the estimator, which is in turn defined for a sample which is a subset of the population:

  1. is a probability space,
  2. is a measurable space (state space),
  3. is a parameter space of dimension ,
  4. is a measurable space,

For example,

  1. is any probability space,
  2. ,
  3. ,

The empirical influence function is defined as follows.

Let and are i.i.d. and is a sample from these variables. is an estimator. Let . The empirical influence function at observation is defined by:

What this means is that we are replacing the i-th value in the sample by an arbitrary value and looking at the output of the estimator. Alternatively, the EIF is defined as the effect, scaled by n+1 instead of n, on the estimator of adding the point to the sample.

Influence function and sensitivity curve

Influence function when Tukey's biweight function (see section M-estimators below) is used as a loss function. Points with large deviation have no influence (y=0).

Instead of relying solely on the data, we could use the distribution of the random variables. The approach is quite different from that of the previous paragraph. What we are now trying to do is to see what happens to an estimator when we change the distribution of the data slightly: it assumes a distribution, and measures sensitivity to change in this distribution. By contrast, the empirical influence assumes a sample set, and measures sensitivity to change in the samples.

Let be a convex subset of the set of all finite signed measures on . We want to estimate the parameter of a distribution in . Let the functional be the asymptotic value of some estimator sequence . We will suppose that this functional is Fisher consistent, i.e. . This means that at the model , the estimator sequence asymptotically measures the correct quantity.

Let be some distribution in . What happens when the data doesn't follow the model exactly but another, slightly different, "going towards" ?

We're looking at:

,

which is the one-sided Gateaux derivative of at , in the direction of .

Let . is the probability measure which gives mass 1 to . We choose . The influence function is then defined by:

It describes the effect of an infinitesimal contamination at the point on the estimate we are seeking, standardized by the mass of the contamination (the asymptotic bias caused by contamination in the observations). For a robust estimator, we want a bounded influence function, that is, one which does not go to infinity as x becomes arbitrarily large.

The empirical influence function uses the empirical distribution function instead of the distribution function , making use of the drop-in principle.

Desirable properties

Properties of an influence function that bestow it with desirable performance are:

  1. Finite rejection point ,
  2. Small gross-error sensitivity ,
  3. Small local-shift sensitivity .

Rejection point

Gross-error sensitivity

Local-shift sensitivity

This value, which looks a lot like a Lipschitz constant, represents the effect of shifting an observation slightly from to a neighbouring point , i.e., add an observation at and remove one at .

M-estimators

(The mathematical context of this paragraph is given in the section on empirical influence functions.)

Historically, several approaches to robust estimation were proposed, including R-estimators and L-estimators. However, M-estimators now appear to dominate the field as a result of their generality, their potential for high breakdown points and comparatively high efficiency. See Huber (1981).

M-estimators are not inherently robust. However, they can be designed to achieve favourable properties, including robustness. M-estimator are a generalization of maximum likelihood estimators (MLEs) which is determined by maximizing or, equivalently, minimizing . In 1964, Huber proposed to generalize this to the minimization of , where is some function. MLE are therefore a special case of M-estimators (hence the name: "Maximum likelihood type" estimators).

Minimizing can often be done by differentiating and solving , where (if has a derivative).

Several choices of and have been proposed. The two figures below show four functions and their corresponding functions.

For squared errors, increases at an accelerating rate, whilst for absolute errors, it increases at a constant rate. When Winsorizing is used, a mixture of these two effects is introduced: for small values of x, increases at the squared rate, but once the chosen threshold is reached (1.5 in this example), the rate of increase becomes constant. This Winsorised estimator is also known as the Huber loss function.

Tukey's biweight (also known as bisquare) function behaves in a similar way to the squared error function at first, but for larger errors, the function tapers off.

Properties of M-estimators

M-estimators do not necessarily relate to a probability density function. Therefore, off-the-shelf approaches to inference that arise from likelihood theory can not, in general, be used.

It can be shown that M-estimators are asymptotically normally distributed so that as long as their standard errors can be computed, an approximate approach to inference is available.

Since M-estimators are normal only asymptotically, for small sample sizes it might be appropriate to use an alternative approach to inference, such as the bootstrap. However, M-estimates are not necessarily unique (i.e., there might be more than one solution that satisfies the equations). Also, it is possible that any particular bootstrap sample can contain more outliers than the estimator's breakdown point. Therefore, some care is needed when designing bootstrap schemes.

Of course, as we saw with the speed-of-light example, the mean is only normally distributed asymptotically and when outliers are present the approximation can be very poor even for quite large samples. However, classical statistical tests, including those based on the mean, are typically bounded above by the nominal size of the test. The same is not true of M-estimators and the type I error rate can be substantially above the nominal level.

These considerations do not "invalidate" M-estimation in any way. They merely make clear that some care is needed in their use, as is true of any other method of estimation.

Influence function of an M-estimator

It can be shown that the influence function of an M-estimator is proportional to , which means we can derive the properties of such an estimator (such as its rejection point, gross-error sensitivity or local-shift sensitivity) when we know its function.

with the given by:

Choice of ψ and ρ

In many practical situations, the choice of the function is not critical to gaining a good robust estimate, and many choices will give similar results that offer great improvements, in terms of efficiency and bias, over classical estimates in the presence of outliers.

Theoretically, functions are to be preferred, and Tukey's biweight (also known as bisquare) function is a popular choice. recommend the biweight function with efficiency at the normal set to 85%.

Robust parametric approaches

M-estimators do not necessarily relate to a density function and so are not fully parametric. Fully parametric approaches to robust modeling and inference, both Bayesian and likelihood approaches, usually deal with heavy-tailed distributions such as Student's t-distribution.

For the t-distribution with degrees of freedom, it can be shown that

For , the t-distribution is equivalent to the Cauchy distribution. The degrees of freedom is sometimes known as the kurtosis parameter. It is the parameter that controls how heavy the tails are. In principle, can be estimated from the data in the same way as any other parameter. In practice, it is common for there to be multiple local maxima when is allowed to vary. As such, it is common to fix at a value around 4 or 6. The figure below displays the -function for 4 different values of .

Example: speed-of-light data

For the speed-of-light data, allowing the kurtosis parameter to vary and maximizing the likelihood, we get

Fixing and maximizing the likelihood gives

A pivotal quantity is a function of data, whose underlying population distribution is a member of a parametric family, that is not dependent on the values of the parameters. An ancillary statistic is such a function that is also a statistic, meaning that it is computed in terms of the data alone. Such functions are robust to parameters in the sense that they are independent of the values of the parameters, but not robust to the model in the sense that they assume an underlying model (parametric family), and in fact, such functions are often very sensitive to violations of the model assumptions. Thus test statistics, frequently constructed in terms of these to not be sensitive to assumptions about parameters, are still very sensitive to model assumptions.

Replacing outliers and missing values

Replacing missing data is called imputation. If there are relatively few missing points, there are some models which can be used to estimate values to complete the series, such as replacing missing values with the mean or median of the data. Simple linear regression can also be used to estimate missing values. In addition, outliers can sometimes be accommodated in the data through the use of trimmed means, other scale estimators apart from standard deviation (e.g., MAD) and Winsorization. In calculations of a trimmed mean, a fixed percentage of data is dropped from each end of an ordered data, thus eliminating the outliers. The mean is then calculated using the remaining data. Winsorizing involves accommodating an outlier by replacing it with the next highest or next smallest value as appropriate.

However, using these types of models to predict missing values or outliers in a long time series is difficult and often unreliable, particularly if the number of values to be in-filled is relatively high in comparison with total record length. The accuracy of the estimate depends on how good and representative the model is and how long the period of missing values extends. When dynamic evolution is assumed in a series, the missing data point problem becomes an exercise in multivariate analysis (rather than the univariate approach of most traditional methods of estimating missing values and outliers). In such cases, a multivariate model will be more representative than a univariate one for predicting missing values. The Kohonen self organising map (KSOM) offers a simple and robust multivariate model for data analysis, thus providing good possibilities to estimate missing values, taking into account their relationship or correlation with other pertinent variables in the data record.

Standard Kalman filters are not robust to outliers. To this end Ting, Theodorou & Schaal (2007) have recently shown that a modification of Masreliez's theorem can deal with outliers.

One common approach to handle outliers in data analysis is to perform outlier detection first, followed by an efficient estimation method (e.g., the least squares). While this approach is often useful, one must keep in mind two challenges. First, an outlier detection method that relies on a non-robust initial fit can suffer from the effect of masking, that is, a group of outliers can mask each other and escape detection. Second, if a high breakdown initial fit is used for outlier detection, the follow-up analysis might inherit some of the inefficiencies of the initial estimator.

Use in machine learning

Although influence functions have a long history in statistics, they were not widely used in machine learning due to several challenges. One of the primary obstacles is that traditional influence functions rely on expensive second-order derivative computations and assume model differentiability and convexity. These assumptions are limiting, especially in modern machine learning, where models are often non-differentiable, non-convex, and operate in high-dimensional spaces.

Koh & Liang (2017) addressed these challenges by introducing methods to efficiently approximate influence functions using second-order optimization techniques, such as those developed by Pearlmutter (1994), Martens (2010), and Agarwal, Bullins & Hazan (2017). Their approach remains effective even when the assumptions of differentiability and convexity degrade, enabling influence functions to be used in the context of non-convex deep learning models. They demonstrated that influence functions are a powerful and versatile tool that can be applied to a variety of tasks in machine learning, including:

  • Understanding Model Behavior: Influence functions help identify which training points are most “responsible” for a given prediction, offering insights into how models generalize from training data.
  • Debugging Models: Influence functions can assist in identifying domain mismatches—when the training data distribution does not match the test data distribution—which can cause models with high training accuracy to perform poorly on test data, as shown by Ben-David et al. (2010). By revealing which training examples contribute most to errors, developers can address these mismatches.
  • Dataset Error Detection: Noisy or corrupted labels are common in real-world data, especially when crowdsourced or adversarially attacked. Influence functions allow human experts to prioritize reviewing only the most impactful examples in the training set, facilitating efficient error detection and correction.
  • Adversarial Attacks: Models that rely heavily on a small number of influential training points are vulnerable to adversarial perturbations. These perturbed inputs can significantly alter predictions and pose security risks in machine learning systems where attackers have access to the training data (See adversarial machine learning).

Koh and Liang’s contributions have opened the door for influence functions to be used in various applications across machine learning, from interpretability to security, marking a significant advance in their applicability.

Permutation test

From Wikipedia, the free encyclopedia
https://en.wikipedia.org/wiki/Permutation_test

A permutation test (also called re-randomization test or shuffle test) is an exact statistical hypothesis test making use of the proof by contradiction. A permutation test involves two or more samples. The null hypothesis is that all samples come from the same distribution . Under the null hypothesis, the distribution of the test statistic is obtained by calculating all possible values of the test statistic under possible rearrangements of the observed data. Permutation tests are, therefore, a form of resampling.

Permutation tests can be understood as surrogate data testing where the surrogate data under the null hypothesis are obtained through permutations of the original data.

In other words, the method by which treatments are allocated to subjects in an experimental design is mirrored in the analysis of that design. If the labels are exchangeable under the null hypothesis, then the resulting tests yield exact significance levels; see also exchangeability. Confidence intervals can then be derived from the tests. The theory has evolved from the works of Ronald Fisher and E. J. G. Pitman in the 1930s.

Permutation tests should not be confused with randomized tests.

Method

Animation of a permutation test being computed on sets of 4 and 5 random values. The 4 values in red are drawn from one distribution, and the 5 values in blue from another; we'd like to test whether the mean values of the two distributions are different. The hypothesis is that the mean of the first distribution is higher than the mean of the second; the null hypothesis is that both groups of samples are drawn from the same distribution. There are 126 distinct ways to put 4 values into one group and 5 into another (9-choose-4 or 9-choose-5). Of these, one is per the original labeling, and the other 125 are "permutations" that generate the histogram of mean differences shown. The p-value of the hypothesis is estimated as the proportion of permutations that give a difference as large or larger than the difference of means of the original samples. In this example, the null hypothesis cannot be rejected at the p = 5% level.

To illustrate the basic idea of a permutation test, suppose we collect random variables and for each individual from two groups and whose sample means are and , and that we want to know whether and come from the same distribution. Let and be the sample size collected from each group. The permutation test is designed to determine whether the observed difference between the sample means is large enough to reject, at some significance level, the null hypothesis H that the data drawn from is from the same distribution as the data drawn from .

The test proceeds as follows. First, the difference in means between the two samples is calculated: this is the observed value of the test statistic, .

Next, the observations of groups and are pooled, and the difference in sample means is calculated and recorded for every possible way of dividing the pooled values into two groups of size and (i.e., for every permutation of the group labels A and B). The set of these calculated differences is the exact distribution of possible differences (for this sample) under the null hypothesis that group labels are exchangeable (i.e., are randomly assigned).

The one-sided p-value of the test is calculated as the proportion of sampled permutations where the difference in means was greater than . The two-sided p-value of the test is calculated as the proportion of sampled permutations where the absolute difference was greater than . Many implementations of permutation tests require that the observed data itself be counted as one of the permutations so that the permutation p-value will never be zero.

Alternatively, if the only purpose of the test is to reject or not reject the null hypothesis, one could sort the recorded differences, and then observe if is contained within the middle % of them, for some significance level . If it is not, we reject the hypothesis of identical probability curves at the significance level.

To exploit variance reduction with paired samples the paired permutation test needs to be applied, see paired difference test.

Relation to parametric tests

Permutation tests are a subset of non-parametric statistics. Assuming that our experimental data come from data measured from two treatment groups, the method simply generates the distribution of mean differences under the assumption that the two groups are not distinct in terms of the measured variable. From this, one then uses the observed statistic ( above) to see to what extent this statistic is special, i.e., the likelihood of observing the magnitude of such a value (or larger) if the treatment labels had simply been randomized after treatment.

In contrast to permutation tests, the distributions underlying many popular "classical" statistical tests, such as the t-test, F-test, z-test, and χ2 test, are obtained from theoretical probability distributions. Fisher's exact test is an example of a commonly used parametric test for evaluating the association between two dichotomous variables. When sample sizes are very large, the Pearson's chi-square test will give accurate results. For small samples, the chi-square reference distribution cannot be assumed to give a correct description of the probability distribution of the test statistic, and in this situation the use of Fisher's exact test becomes more appropriate.

Permutation tests exist in many situations where parametric tests do not (e.g., when deriving an optimal test when losses are proportional to the size of an error rather than its square). All simple and many relatively complex parametric tests have a corresponding permutation test version that is defined by using the same test statistic as the parametric test, but obtains the p-value from the sample-specific permutation distribution of that statistic, rather than from the theoretical distribution derived from the parametric assumption. For example, it is possible in this manner to construct a permutation t-test, a permutation test of association, a permutation version of Aly's test for comparing variances and so on.

The major drawbacks to permutation tests are that they

  • Can be computationally intensive and may require "custom" code for difficult-to-calculate statistics. This must be rewritten for every case.
  • Are primarily used to provide a p-value. The inversion of the test to get confidence regions/intervals requires even more computation.

Advantages

Permutation tests exist for any test statistic, regardless of whether or not its distribution is known. Thus one is always free to choose the statistic which best discriminates between hypothesis and alternative and which minimizes losses.

Permutation tests can be used for analyzing unbalanced designs and for combining dependent tests on mixtures of categorical, ordinal, and metric data (Pesarin, 2001). They can also be used to analyze qualitative data that has been quantitized (i.e., turned into numbers). Permutation tests may be ideal for analyzing quantitized data that do not satisfy statistical assumptions underlying traditional parametric tests (e.g., t-tests, ANOVA), see PERMANOVA.

Before the 1980s, the burden of creating the reference distribution was overwhelming except for data sets with small sample sizes.

Since the 1980s, the confluence of relatively inexpensive fast computers and the development of new sophisticated path algorithms applicable in special situations made the application of permutation test methods practical for a wide range of problems. It also initiated the addition of exact-test options in the main statistical software packages and the appearance of specialized software for performing a wide range of uni- and multi-variable exact tests and computing test-based "exact" confidence intervals.

Limitations

An important assumption behind a permutation test is that the observations are exchangeable under the null hypothesis. An important consequence of this assumption is that tests of difference in location (like a permutation t-test) require equal variance under the normality assumption. In this respect, the classic permutation t-test shares the same weakness as the classical Student's t-test (the Behrens–Fisher problem). This can be addressed in the same way the classic t-test has been extended to handle unequal variances: by employing the Welch statistic with Satterthwaite adjustment to the degrees of freedom. A third alternative in this situation is to use a bootstrap-based test. Statistician Phillip Good explains the difference between permutation tests and bootstrap tests the following way: "Permutations test hypotheses concerning distributions; bootstraps test hypotheses concerning parameters. As a result, the bootstrap entails less-stringent assumptions." Bootstrap tests are not exact. In some cases, a permutation test based on a properly studentized statistic can be asymptotically exact even when the exchangeability assumption is violated. Bootstrap-based tests can test with the null hypothesis and, therefore, are suited for performing equivalence testing.

Monte Carlo testing

An asymptotically equivalent permutation test can be created when there are too many possible orderings of the data to allow complete enumeration in a convenient manner. This is done by generating the reference distribution by Monte Carlo sampling, which takes a small (relative to the total number of permutations) random sample of the possible replicates. The realization that this could be applied to any permutation test on any dataset was an important breakthrough in the area of applied statistics. The earliest known references to this approach are Eden and Yates (1933) and Dwass (1957). This type of permutation test is known under various names: approximate permutation test, Monte Carlo permutation tests or random permutation tests.

After random permutations, it is possible to obtain a confidence interval for the p-value based on the Binomial distribution, see Binomial proportion confidence interval. For example, if after random permutations the p-value is estimated to be , then a 99% confidence interval for the true (the one that would result from trying all possible permutations) is .

On the other hand, the purpose of estimating the p-value is most often to decide whether , where is the threshold at which the null hypothesis will be rejected (typically ). In the example above, the confidence interval only tells us that there is roughly a 50% chance that the p-value is smaller than 0.05, i.e. it is completely unclear whether the null hypothesis should be rejected at a level .

If it is only important to know whether for a given , it is logical to continue simulating until the statement can be established to be true or false with a very low probability of error. Given a bound on the admissible probability of error (the probability of finding that when in fact or vice versa), the question of how many permutations to generate can be seen as the question of when to stop generating permutations, based on the outcomes of the simulations so far, in order to guarantee that the conclusion (which is either or ) is correct with probability at least as large as . ( will typically be chosen to be extremely small, e.g. 1/1000.) Stopping rules to achieve this have been developed which can be incorporated with minimal additional computational cost. In fact, depending on the true underlying p-value it will often be found that the number of simulations required is remarkably small (e.g. as low as 5 and often not larger than 100) before a decision can be reached with virtual certainty.

Absent-mindedness

From Wikipedia, the free encyclopedia
https://en.wikipedia.org/wiki/Absent-mindedness

In the field of psychology, absent-mindedness is a mental state wherein a person is forgetfully inattentive. It is the opposite mental state of mindfulness.

Absentmindedness is often caused by things such as boredom, sleepiness, rumination, distraction, or preoccupation with one's own internal monologue. When experiencing absent-mindedness, people exhibit signs of memory lapses and weak recollection of recent events.

Absent-mindedness can usually be a result of a variety of other conditions often diagnosed by clinicians, such as attention deficit hyperactivity disorder and depression. In addition to absent-mindedness leading to an array of consequences affecting daily life, it can have more severe, long-term problems.

Conceptualization

Absent-mindedness seemingly consists of lapses of concentration or "zoning out". This can result in lapses of short or long-term memory, depending on when the person in question was in a state of absent-mindedness. Absent-mindedness also relates directly to lapses in attention. Schachter and Dodsen of the Harvard Psychology department say, that in the context of memory, "absent-mindedness entails inattentive or shallow processing that contributes to weak memories of ongoing events or forgetting to do things in the future".

Causes

Though absent-mindedness is a frequent occurrence, there has been little progress made on what the direct causes of absent-mindedness are. However, it tends to co-occur with ill health, preoccupation, and distraction.

The condition has three potential causes:

  1. a low level of attention ("blanking" or "zoning out");
  2. intense attention to a single object of focus (hyperfocus) that makes a person oblivious to events around them; or
  3. unwarranted distraction of attention from the object of focus by irrelevant thoughts or environmental events.

Absent-mindedness is also noticed as a common characteristic of personalities with schizoid personality disorder.

Consequences

Lapses of attention are clearly a part of everyone's life. Some are merely inconvenient, such as missing a familiar turn-off on the highway, while some are extremely serious, such as failures of attention that cause accidents, injury, or loss of life. Sometimes, lapses of attention can lead to a significant impact on personal behaviour, which can influence an individual's pursuit of goals. Beyond the obvious costs of accidents arising from lapses in attention, there are lost time; efficiency; personal productivity; and quality of life. These can also occur in the lapse and recapture of awareness and attention to everyday tasks. Individuals for whom intervals between lapses are very short are typically viewed as impaired. Given the prevalence of attentional failures in everyday life, and the ubiquitous and sometimes disastrous consequences of such failures, it is rather surprising that relatively little work has been done to directly measure individual differences in everyday errors arising from propensities for failures of attention. Absent-mindedness can also lead to bad grades at school, boredom, and depression.

The absent-minded professor is a stock character often depicted in fictional works, usually as a talented academic whose focus on academic matters leads them to ignore or forget their surroundings. This stereotypical view can be traced back as far as the philosopher Thales, who it is said, "walked at night with his eyes focused on the heavens and, as a result, fell down a well". One classic example of this is in the Disney film The Absent-Minded Professor made in 1963 and based on the short story "A Situation of Gravity", by Samuel W. Taylor. Two examples of this character portrayed in more modern media include doctor Emmett Brown from Back to the Future and Professor Farnsworth of Futurama.

In literature, "The Absent-Minded Beggar" is a poem by Rudyard Kipling, written in 1899, and was directed at the absent–mindedness of the population of Great Britain in ignoring the plight of their troops in the Boer War. The poem illustrated the fact that soldiers who could not return to their previous jobs needed support, and the need to raise money to support the fighting troops. The poem was also set to music by Gilbert & Sullivan and a campaign raised to support the British troops, especially on their departure and return, and the sick and wounded. Franz Kafka also wrote "Absent-minded Window-gazing", one of his short-story titles from Betrachtung.

Other characters include:

Measurement and treatment

Absent-mindedness can be avoided or fixed in several ways. Although it can not be accomplished through medical procedures, it can be accomplished through psychological treatments. Some examples include: altering work schedules to make them shorter, having frequent rest periods and utilizing a drowsy-operator warning device.

Absent-mindedness and its related topics are often measured in scales developed in studies to survey boredom and attention levels. For instance, the Attention-Related Cognitive Errors Scale (ARCES) reflects errors in performance that result from attention lapses. Another scale, called the Mindful Attention Awareness Scale (MAAS) measures the ability to maintain a reasonable level of attention in everyday life. The Boredom Proneness Scale (BPS) measures the level of boredom in relation to the attention level of the subject.

Absent-mindedness can lead to automatic behaviors or automatisms. Additionally, absent-minded actions can involve behavioral mistakes. A phenomenon called Attention-Lapse Induced Alienation occurs when a person makes a mistake while absent-minded. The person then attributes the mistake to their hand rather than their self, because they were not paying attention.

Another related topic to absent-mindedness is daydreaming. It may be beneficial to differentiate between these two topics. Daydreaming can be viewed as a coping or defense mechanism. As opposed to inattentiveness, daydreaming is a way for emotions to be explored and even expressed through fantasy. It may even bring attention to previously experienced problems or circumstances. It is also a way to bring about creativity.

Thursday, November 7, 2024

Mind-wandering

From Wikipedia, the free encyclopedia

Mind-wandering is loosely defined as thoughts that are not produced from the current task. Mind-wandering consists of thoughts that are task-unrelated and stimulus-independent. This can be in the form of three different subtypes: positive constructive daydreaming, guilty fear of failure, and poor attentional control.

In general, a folk explanation of mind-wandering could be described as the experience of thoughts not remaining on a single topic for a long period of time, particularly when people are engaged in an attention-demanding task.

One context in which mind-wandering often occurs is driving. This is because driving under optimal conditions becomes an almost automatic activity that can require minimal use of the task positive network, the brain network that is active when one is engaged in an attention-demanding activity. In situations where vigilance is low, people do not remember what happened in the surrounding environment because they are preoccupied with their thoughts. This is known as the decoupling hypothesis.

Studies using event-related potentials (ERPs) have quantified the extent that mind-wandering reduces the cortical processing of the external environment. When thoughts are unrelated to the task at hand, the brain processes both task-relevant and unrelated sensory information in a less detailed manner.

Mind-wandering appears to be a stable trait of people and a transient state. Studies have linked performance problems in the laboratory and in daily life. Mind-wandering has been associated with possible car accidents. Mind-wandering is also intimately linked to states of affect. Studies indicate that task-unrelated thoughts are common in people with low or depressed mood. Mind-wandering also occurs when a person is intoxicated via the consumption of alcohol.

Studies have demonstrated a prospective bias to spontaneous thought because individuals tend to engage in more future than past related thoughts during mind-wandering. The default mode network is thought to be involved in mind-wandering and internally directed thought, although recent work has challenged this assumption.

History

The history of mind-wandering research dates back to 18th century England. British philosophers struggled to determine whether mind-wandering occurred in the mind or if an outside source caused it. In 1921, Varendonck published The Psychology of Day-Dreams, in which he traced his "'trains of thoughts' to identify their origins, most often irrelevant external influences".

Wallas (1926) considered mind-wandering as an important aspect of his second stage of creative thought – incubation. It was not until the 1960s that the first documented studies were conducted on mind-wandering. John Antrobus and Jerome L. Singer developed a questionnaire and discussed the experience of mind-wandering.

This questionnaire, known as the Imaginal Processes Inventory (IPI), provides a trait measure of mind-wandering and it assesses the experience on three dimensions: how vivid the person's thoughts are, how many of those thoughts are guilt- or fear-based, and how deep into the thought a person goes. As technology continues to develop, psychologists are starting to use functional magnetic resonance imaging to observe mind-wandering in the brain and reduce psychologists' reliance on verbal reports.

Research methods

Jonathan Smallwood and colleagues popularized the study of mind-wandering using thought sampling and questionnaires. Mind-wandering is studied using experience sampling either online or retrospectively. One common paradigm within which to study mind-wandering is the SART (sustained attention to response task).

In a SART task there are two categories of words. One of the categories are the target words. In each block of the task a word appears for about 300 ms, there will be a pause and then another word. When a target word appears the participant hits a designated key. About 60% of the time after a target word a thought probe will appear to gauge whether thoughts were on task. If participants were not engaged in the task they were experiencing task-unrelated thoughts (TUTs), signifying mind-wandering.

Another task to judge TUTs is the experience sampling method (ESM). Participants carry around a personal digital assistant (PDA) that signals several times a day. At the signal a questionnaire is provided. The questionnaire questions vary but can include: (a) whether or not their minds had wandered at the time of the (b) what state of control they had over their thoughts and (c) about the content of their thoughts.

Questions about context are also asked to measure the level of attention necessary for the task. One process used was to give participants something to focus on and then at different times ask them what they were thinking about. Those who were not thinking about what was given to them were considered "wandering". Another process was to have participants keep a diary of their mind-wandering. Participants are asked to write a brief description of their mind-wandering and the time in which it happened. These methodologies are improvements on past methods that were inconclusive.

Neuroscience

Mind-wandering is important in understanding how the brain produces what William James called the train of thought and the stream of consciousness. This aspect of mind-wandering research is focused on understanding how the brain generates the spontaneous and relatively unconstrained thoughts that are experienced when the mind wanders.

One candidate neural mechanism for generating this aspect of experience is a network of regions in the medial frontal and medial parietal cortex known as the default network. This network of regions is highly active even when participants are resting with their eyes closed suggesting a role in generating spontaneous internal thoughts. One relatively controversial result is that periods of mind-wandering are associated with increased activation in both the default and executive system a result that implies that mind-wandering may often be goal oriented.

It is commonly assumed that the default mode network is known to be involved during mind-wandering. The default mode network is active when a person is not focused on the outside world and the brain is at wakeful rest because experiences such as mind-wandering and daydreaming are common in this state.

It is also active when the individual is thinking about others, thinking about themselves, remembering the past, and planning for the future. However, recent studies show that signals in the default mode network provide information regarding patterns of detailed experience in active tasks states. This data suggests that the relationship between the default mode network and mind-wandering remains a matter of conjecture.

In addition to neural models, computational models of consciousness based on Bernard Baars' Global Workspace theory suggest that mind-wandering, or "spontaneous thought" may involve competition between internally and externally generated activities attempting to gain access to a limited capacity central network.

Individual differences

There are individual differences in some aspects of mind-wandering between older and younger adults. Although older adults reported less mind-wandering, these older participants showed the same amount of mind-wandering as younger adults. There were also differences in how participants responded to an error.

After an error, older adults took longer to return focus back to the task when compared with younger adults. It is possible that older adults reflect more about an error due to conscientiousness. Research has shown that older adults tend to be more conscientious than young adults. Personality can also affect mind-wandering.

People that are more conscientious are less prone to mind-wandering. Being more conscientious allows people to stay focused on the task better which causes fewer instances of mind-wandering. Differences in mind-wandering between young and older adults may be limited because of this personality difference.

Mental disorders such as ADHD (attention deficit hyperactivity disorder) are linked to mind-wandering. Seli et al. (2015) found that spontaneous mind-wandering, the uncontrolled or unwarranted shifting of attention, is a characteristic of those who have ADHD. However, they note that deliberate mind-wandering, or the purposeful shifting of one's attention to different stimuli, is not a consistent characteristic of having ADHD.

Franklin et al. (2016) arrived at similar conclusions; they had college students take multiple psychological evaluations that gauge ADHD symptom strength. Then, they had the students read a portion of a general science textbook. At various times and at random intervals throughout their reading, participants were prompted to answer a question that asked if their attention was either on task, slightly on task, slightly off task, or off task prior to the interruption.

In addition, they were asked if they were aware, unaware, or neither aware nor unaware of their thoughts as they read. Lastly, they were tasked to press the space bar if they ever caught themselves mind-wandering. For a week after these assessments, the students answered follow-up questions that also gauged mind-wandering and awareness.

This study's results revealed that students with higher ADHD symptomology showed less task-oriented control than those with lower ADHD symptomology. Additionally, those with lower ADHD symptomology were more likely to engage in useful or deliberate mind-wandering and were more aware of their inattention. One of the strengths of this study is that it was performed in both lab and daily-life situations, giving it broad application.

Mind-wandering in and of itself is not necessarily indicative of attention deficiencies. Studies show that humans typically spend 25-50% of their time thinking about thoughts irrelevant to their current situations.

In many disorders it is the regulation of the overall amount of mind-wandering that is disturbed, leading to increased distractibility when performing tasks. Additionally, the contents of mind-wandering is changed; thoughts can be more negative and past-oriented, particularly unstable or self-centered.

Working memory

Recent research has studied the relationship between mind-wandering and working memory capacity. Working memory capacity represents personal skill to have a good command of individual's mind. This relationship requires more research to understand how they influence one another. It is possible that mind-wandering causes lower performance on working memory capacity tasks or that lower working memory capacity causes more instances of mind-wandering.

Only the second of these has actually been proven. Reports of task-unrelated thoughts are less frequent when performing tasks that do not demand continuous use of working memory than tasks which do. Moreover, individual difference studies demonstrate that when tasks are non-demanding, high levels of working memory capacity are associated with more frequent reports of task-unrelated thinking especially when it is focused on the future. By contrast, when performing tasks that demand continuous attention, high levels of working memory capacity are associated with fewer reports of task-unrelated thoughts.

Together these data are consistent with the claim that working memory capacity helps sustain a train of thought whether it is generated in response to a perceptual event or is self-generated by the individual. Therefore, under certain circumstances, the experience of mind-wandering is supported by working memory resources. Working memory capacity variation in individuals has been proven to be a good predictor of the natural tendency for mind-wandering to occur during cognitively demanding tasks and various activities in daily life.

Mind-wandering sometimes occurs as a result of saccades, which are the movements of one's eyes to different visual stimuli. In an antisaccade task, for example, subjects with higher working memory capacity scores resisted looking at the flashing visual cue better than participants with lower working memory capacity. Higher working memory capacity is associated with fewer saccades toward environmental cues.

Mind-wandering has been shown to be related to goal orientation; people with higher working memory capacity keep their goals more accessible than those who have lower working memory capacity, thus allowing these goals to better guide their behavior and keep them on task.

Another study compared differences in speed of processing information between people of different ages. The task they used was a go/no go task where participants responded if a white arrow moved in a specific direction but did not respond if the arrow moved in the other direction or was a different color. In this task, children and young adults showed similar speed of processing but older adults were significantly slower.

Speed of processing information affects how much information can be processed in working memory.  People with faster speed of processing can encode information into memory better than people that have slower speed of processing. This can lead to memory of more items because more things can be encoded.

Retention

Mind-wandering affects retention where working memory capacity is directly related to reading comprehension levels. Participants with lower working memory capacity perform worse on comprehension-based tests.

When investigating how mind-wandering affects retention of information, experiments are conducted where participants are asked a variety of questions about factual information, or deducible information while reading a detective novel. Participants are also asked about the state of their mind before the questions are asked.

Throughout the reading itself, the author provides important cues to identify the villain, known as inference critical episodes (ICEs). The questions are asked randomly and before critical episodes are reached. It was found that episodes of mind-wandering, especially early on in the text led to decreased identification of the villain and worse results on both factual and deducible questions.

Therefore, when mind-wandering occurs during reading, the text is not processed well enough to remember key information about the story. Furthermore, both the timing and the frequency of mind-wandering helps determine how much information is retained from the narrative.

Reading comprehension

Reading comprehension must also be investigated in terms of text difficulty. To assess this, researchers provide an easy and hard version of a reading task. During this task, participants are interrupted and asked whether their thoughts at the time of interruption had been related or unrelated to the task. What is found is that mind-wandering has a negative effect on text comprehension in more difficult readings.

This supports the executive-resource hypothesis which describes that both task related and task-unrelated thoughts (TUT) compete for executive function resources. Therefore, when the primary task is difficult, little resources are available for mind-wandering, whereas when the task is simple, the possibility for mind-wandering is abundant because it takes little executive control to focus on simple tasks.

However, mind-wandering tends to occur more frequently in harder readings as opposed to easier readings. Therefore, it is possible that similar to retention, mind-wandering increases when readers have difficulty constructing a model of the story.

Happiness

As part of his doctoral research at Harvard University, Matthew Killingsworth used an iPhone app that captured a user's feelings in real time. The tool alerts the user at random times and asks: "How are you feeling right now?" and "What are you doing right now?" Killingsworth and Gilbert's analysis suggested that mind-wandering was much more typical in daily activities than in laboratory settings.

They also describe that people were less happy when their minds were wandering than when they were otherwise occupied. This effect was somewhat counteracted by people's tendency to mind-wander to happy topics, but unhappy mind-wandering was more likely to be rated as more unpleasant than other activities.

The authors note that unhappy moods can also cause mind-wandering, but the time-lags between mind-wandering and mood suggests that mind-wandering itself can also lead to negative moods. Furthermore, research suggests that regardless of working memory capacity, subjects participating in mind-wandering experiments report more mind-wandering when bored, stressed, or unhappy.

Executive functions

Executive functions (EFs) are cognitive processes that make a person pay attention or concentrate on a task. Three executive functions that relate to memory are inhibiting, updating and shifting. Inhibiting controls a person's attention and thoughts when distractions are abundant. Updating reviews old information and replaces it with new information in the working memory. Shifting controls the ability to go between multiple tasks. All three EFs have a relationship to mind-wandering.

Executive functions have roles in attention problems, attention control, thought control, and working memory capacity. Attention problems relate to behavioral problems such as inattention, impulsivity and hyperactivity. These behaviors make staying on task difficult leading to more mind-wandering. Higher inhibiting and updating abilities correlates to lower levels of attention problems in adolescence.

The inhibiting executive function controls attention and thought. The failure of cognitive inhibition is a direct cause of mind-wandering. Mind-wandering is also connected to working memory capacity (WMC). People with higher WMC mind-wander less on high concentration tasks no matter their boredom levels. People with low WMC are better at staying on task for low concentration tasks, but once the task increases in difficulty they had a hard time keeping their thoughts focused on task.

Updating takes place in the working memory, therefore those with low WMC have a lower updating executive function ability. That means a low performing updating executive function can be an indicator of high mind-wandering. Working memory relies on executive functions, with mind-wandering as an indicator of their failure. Task-unrelated thoughts (TUTs) are empirical behavioral manifestations of mind-wandering in a person. The longer a task is performed the more TUTs reported. Mind-wandering is an indication of an executive control failure that is characterized by TUTs.

Metacognition serves to correct the wandering mind, suppressing spontaneous thoughts and bringing attention back to more "worthwhile" tasks.

Fidgeting

Paul Seli and colleagues have shown that spontaneous mind-wandering is associated with increased fidgeting; by contrast, interest, attention and visual engagement lead to Non-Instrumental Movement Inhibition. One possible application for this phenomenon is that detection of non-instrumental movements may be an indicator of attention or boredom in computer aided learning.

Traditionally teachers and students have viewed fidgeting as a sign of diminished attention, which is summarized by the statement, “Concentration of consciousness, and concentration of movements, diffusion of ideas and diffusion of movements go together.” However, James Farley and colleagues have proposed that fidgeting is not only an indicator of spontaneous mind-wandering, but is also a subconscious attempt to increase arousal in order to improve attention and thus reduce mind-wandering.

Introduction to entropy

From Wikipedia, the free encyclopedia https://en.wikipedia.org/wiki/Introduct...