A Medley of Potpourri

Tuesday, October 22, 2019

Spurious relationship

From Wikipedia, the free encyclopedia

In statistics, a spurious relationship or spurious correlation is a mathematical relationship in which two or more events or variables are associated but not causally related, due to either coincidence or the presence of a certain third, unseen factor (referred to as a "common response variable", "confounding factor", or "lurking variable").

Examples

A well-known case of a spurious relationship can be found in the time-series literature, where a spurious regression is a regression that provides misleading statistical evidence of a linear relationship between independent non-stationary variables. In fact, the non-stationarity may be due to the presence of a unit root in both variables. In particular, any two nominal economic variables are likely to be correlated with each other, even when neither has a causal effect on the other, because each equals a real variable times the price level, and the common presence of the price level in the two data series imparts correlation to them.

An example of a spurious relationship can be seen by examining a city's ice cream sales. These sales are highest when the rate of drownings in city swimming pools is highest. To allege that ice cream sales cause drowning, or vice versa, would be to imply a spurious relationship between the two. In reality, a heat wave may have caused both. The heat wave is an example of a hidden or unseen variable, also known as a confounding variable.

Another commonly noted example is a series of Dutch statistics showing a positive correlation between the number of storks nesting in a series of springs and the number of human babies born at that time. Of course there was no causal connection; they were correlated with each other only because they were correlated with the weather nine months before the observations. However Höfer et al. (2004) showed the correlation to be stronger than just weather variations as he could show in post reunification Germany that, while the number of clinical deliveries was not linked with the rise in stork population, out of hospital deliveries correlated with the stork population.

In rare cases, a spurious relationship can occur between two completely unrelated variables without any confounding variable, as was the case between the success of the Washington Redskins professional football team in a specific game before each presidential election and the success of the incumbent President's political party in said election. For 16 consecutive elections between 1940 and 2000, the Redskins Rule correctly matched whether the incumbent President's political party would retain or lose the Presidency. The rule eventually failed shortly after Elias Sports Bureau discovered the correlation in 2000; in 2004, 2012 and 2016, the results of the Redskins game and the election did not match.

Hypothesis testing

Often one tests a null hypothesis of no correlation between two variables, and chooses in advance to reject the hypothesis if the correlation computed from a data sample would have occurred in less than (say) 5% of data samples if the null hypothesis were true. While a true null hypothesis will be accepted 95% of the time, the other 5% of the times having a true null of no correlation a zero correlation will be wrongly rejected, causing acceptance of a correlation which is spurious (an event known as Type I error). Here the spurious correlation in the sample resulted from random selection of a sample that did not reflect the true properties of the underlying population.

Detecting spurious relationships

The term "spurious relationship" is commonly used in statistics and in particular in experimental research techniques, both of which attempt to understand and predict direct causal relationships (X → Y). A non-causal correlation can be spuriously created by an antecedent which causes both (W → X and W → Y). Mediating variables, (X → W → Y), if undetected, estimate a total effect rather than direct effect without adjustment for the mediating variable M. Because of this, experimentally identified correlations do not represent causal relationships unless spurious relationships can be ruled out.

Experiments

In experiments, spurious relationships can often be identified by controlling for other factors, including those that have been theoretically identified as possible confounding factors. For example, consider a researcher trying to determine whether a new drug kills bacteria; when the researcher applies the drug to a bacterial culture, the bacteria die. But to help in ruling out the presence of a confounding variable, another culture is subjected to conditions that are as nearly identical as possible to those facing the first-mentioned culture, but the second culture is not subjected to the drug. If there is an unseen confounding factor in those conditions, this control culture will die as well, so that no conclusion of efficacy of the drug can be drawn from the results of the first culture. On the other hand, if the control culture does not die, then the researcher cannot reject the hypothesis that the drug is efficacious.

Non-experimental statistical analyses

Disciplines whose data are mostly non-experimental, such as economics, usually employ observational data to establish causal relationships. The body of statistical techniques used in economics is called econometrics. The main statistical method in econometrics is multivariable regression analysis. Typically a linear relationship such as

y=a_{0}+a_{1}x_{1}+a_{2}x_{2}+\cdots +a_{k}x_{k}+e

is hypothesized, in which

y

is the dependent variable (hypothesized to be the caused variable),

x_{j}

for j = 1, ..., k is the j^th independent variable (hypothesized to be a causative variable), and

e

is the error term (containing the combined effects of all other causative variables, which must be uncorrelated with the included independent variables). If there is reason to believe that none of the

x_{j}

s is caused by y, then estimates of the coefficients

a_{j}

are obtained. If the null hypothesis that

a_{j}=0

is rejected, then the alternative hypothesis that

a_{j}\neq 0

and equivalently that

x_{j}

causes y cannot be rejected. On the other hand, if the null hypothesis that

a_{j}=0

cannot be rejected, then equivalently the hypothesis of no causal effect of

x_{j}

on y cannot be rejected. Here the notion of causality is one of contributory causality: If the true value

a_{j}\neq 0

, then a change in

x_{j}

will result in a change in y unless some other causative variable(s), either included in the regression or implicit in the error term, change in such a way as to exactly offset its effect; thus a change in

x_{j}

is not sufficient to change y. Likewise, a change in

x_{j}

is not necessary to change y, because a change in y could be caused by something implicit in the error term (or by some other causative explanatory variable included in the model).

Regression analysis controls for other relevant variables by including them as regressors (explanatory variables). This helps to avoid mistaken inference of causality due to the presence of a third, underlying, variable that influences both the potentially causative variable and the potentially caused variable: its effect on the potentially caused variable is captured by directly including it in the regression, so that effect will not be picked up as a spurious effect of the potentially causative variable of interest. In addition, the use of multivariate regression helps to avoid wrongly inferring that an indirect effect of, say x₁ (e.g., x₁ → x₂ → y) is a direct effect (x₁ → y).

Just as an experimenter must be careful to employ an experimental design that controls for every confounding factor, so also must the user of multiple regression be careful to control for all confounding factors by including them among the regressors. If a confounding factor is omitted from the regression, its effect is captured in the error term by default, and if the resulting error term is correlated with one (or more) of the included regressors, then the estimated regression may be biased or inconsistent (see omitted variable bias).

In addition to regression analysis, the data can be examined to determine if Granger causality exists. The presence of Granger causality indicates both that x precedes y, and that x contains unique information about y.

Other relationships

There are several other relationships defined in statistical analysis as follows.

All models are wrong

From Wikipedia, the free encyclopedia

"All models are wrong" is a common aphorism in statistics; it is often expanded as "All models are wrong, but some are useful". It is usually considered to be applicable to not only statistical models, but to scientific models generally. The aphorism is generally attributed to the statistician George Box, although the underlying concept predates Box's writings.

Quotations of George Box

The first record of Box saying "all models are wrong" is in a 1976 paper published in the Journal of the American Statistical Association. The 1976 paper contains the aphorism twice. The two sections of the paper that contain the aphorism are copied below.

2.3 Parsimony
Since all models are wrong the scientist cannot obtain a "correct" one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity.
2.4 Worrying Selectively
Since all models are wrong the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad.

Box repeated the aphorism in a paper that was published in the proceedings of a 1978 statistics workshop. The paper contains a section entitled "All models are wrong but some are useful". The section is copied below.

Now it would be very remarkable if any system existing in the real world could be exactly represented by any simple model. However, cunningly chosen parsimonious models often do provide remarkably useful approximations. For example, the law PV = RT relating pressure P, volume V and temperature T of an "ideal" gas via a constant R is not exactly true for any real gas, but it frequently provides a useful approximation and furthermore its structure is informative since it springs from a physical view of the behavior of gas molecules.
For such a model there is no need to ask the question "Is the model true?". If "truth" is to be the "whole truth" the answer must be "No". The only question of interest is "Is the model illuminating and useful?".

Box repeated the aphorism twice more in his 1987 book, Empirical Model-Building and Response Surfaces (which was co-authored with Norman Draper). The first repetition is on p. 74: "Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful." The second repetition is on p. 424, which is excerpted below.

... all models are approximations. Essentially, all models are wrong, but some are useful. However, the approximate nature of the model must always be borne in mind ...

A second edition of the book was published in 2007, under the title Response Surfaces, Mixtures, and Ridge Analyses. The second edition also repeats the aphorism twice, in contexts identical with those of the first edition (on p. 63 and p. 414).

Box repeated the aphorism two more times in his 1997 book, Statistical Control: By Monitoring and Feedback Adjustment (which was co-authored with Alberto Luceño). The first repetition is on p. 6, which is excerpted below.

It has been said that "all models are wrong but some models are useful." In other words, any model is at best a useful fiction—there never was, or ever will be, an exactly normal distribution or an exact linear relationship. Nevertheless, enormous progress has been made by entertaining such fictions and using them as approximations.

The second repetition is on p. 9: "So since all models are wrong, it is very important to know what to worry about; or, to put it in another way, what models are likely to produce procedures that work in practice (where exact assumptions are never true)".

A second edition of the book was published in 2009, under the title Statistical Control By Monitoring and Adjustment (co-authored with Alberto Luceño and Maria del Carmen Paniagua-Quiñones). The second edition also repeats the aphorism two times. The first repetition is on p. 61, which is excerpted below.

All models are approximations. Assumptions, whether implied or clearly stated, are never exactly true. All models are wrong, but some models are useful. So the question you need to ask is not "Is the model true?" (it never is) but "Is the model good enough for this particular application?"

The second repetition is on p. 63; its context is essentially the same as that of the second repetition in the first edition.

Box's widely cited book Statistics for Experimenters (co-authored with William Hunter) does not include the aphorism in its first edition (published in 1978). The second edition (published in 2005; co-authored with William Hunter and J. Stuart Hunter) includes the aphorism three times: on p. 208, p. 384, and p. 440. On p. 440, the relevant sentence is this: "The most that can be expected from any model is that it can supply a useful approximation to reality: All models are wrong; some models are useful".

In addition to stating the aphorism verbatim, Box sometimes stated the essence of the aphorism with different words. One example is from 1978, while Box was President of the American Statistical Association. At the annual meeting of the Association, Box delivered his Presidential Address, wherein he stated this: "Models, of course, are never true, but fortunately it is only necessary that they be useful".

Discussions

There have been varied discussions about the aphorism. A selection from those discussions is presented below.

In 1983, the statisticians Peter McCullagh and John Nelder published their much-cited book on generalized linear models. The book includes a brief discussion of the aphorism (though without citing Box). A second edition of the book, published in 1989, contains a very similar discussion of the aphorism. The discussion from the first edition is as follows.

Modelling in science remains, partly at least, an art. Some principles do exist, however, to guide the modeller. The first is that all models are wrong; some, though, are better than others and we can search for the better ones. At the same time we must recognize that eternal truth is not within our grasp.

In 1995, the statistician Sir David Cox commented as follows.

... it does not seem helpful just to say that all models are wrong. The very word model implies simplification and idealization. The idea that complex physical, biological or sociological systems can be exactly described by a few formulae is patently absurd. The construction of idealized representations that capture important stable aspects of such systems is, however, a vital part of general scientific analysis and statistical models, especially substantive ones, do not seem essentially different from other kinds of model.

In 1996, an Applied Statistician's Creed was proposed. The Creed includes, in its core part, the aphorism.

In 2002, K.P. Burnham and D.R. Anderson published their much-cited book on statistical model selection. The book states the following.

A model is a simplification or approximation of reality and hence will not reflect all of reality. ... Box noted that "all models are wrong, but some are useful." While a model can never be "truth," a model might be ranked from very useful, to useful, to somewhat useful to, finally, essentially useless.

The statistician J. Michael Steele has commented on the aphorism as follows.

... there are wonderful models — like city maps....
If I say that a map is wrong, it means that a building is misnamed, or the direction of a one-way street is mislabeled. I never expected my map to recreate all of physical reality, and I only feel ripped off if my map does not correctly answer the questions that it claims to answer.
My maps of Philadelphia are useful. Moreover, except for a few that are out-of-date, they are not wrong.
So, you say, "Yes, a map can be thought of as a model, but surely it would be more precise to say that a map is a 'visually enhanced database.' Such databases can be correct. These are not the kinds of models that Box had in mind."
I agree. ...

In 2008, the statistician Andrew Gelman responded to that, saying in particular the following.

I take his general point, which is that a street map could be exactly correct, to the resolution of the map.
... The saying, "all models are wrong," is helpful because it is not completely obvious....
This is a simple point, and I can see how Steele can be irritated by people making a big point about it. But, the trouble is, many people don't realize that all models are wrong.

In 2013, the philosopher of science Peter Truran published an essay related to the aphorism. The essay notes, in particular, the following.

... seemingly incompatible models may be used to make predictions about the same phenomenon. ... For each model we may believe that its predictive power is an indication of its being at least approximately true. But if both models are successful in making predictions, and yet mutually inconsistent, how can they both be true? Let us consider a simple illustration. Two observers are looking at a physical object. One may report seeing a circular disc, and the other may report seeing a rectangle. Both will be correct, but one will be looking at the object (a cylindrical can) from above and the other will be observing from the side. The two models represent different aspects of the same reality.

Truran's essay further notes that Newton's theory of gravitation has been supplanted by Einstein's theory of relativity and yet Newton's theory remains generally "empirically adequate". Indeed, Newton's theory generally has excellent predictive power. Yet Newton's theory is not an approximation of Einstein's theory. For illustration, consider an apple falling down from a tree. Under Newton's theory, the apple falls because Earth exerts a force on the apple—what is called "the force of gravity". Under Einstein's theory, Earth does not exert any force on the apple. Hence, Newton's theory might be regarded as being, in some sense, completely wrong but extremely useful. (The usefulness of Newton's theory comes partly from being vastly simpler, both mathematically and computationally, than Einstein's theory.)

In 2014, the statistician David Hand made the following statement.

In general, when building statistical models, we must not forget that the aim is to understand something about the real world. Or predict, choose an action, make a decision, summarize evidence, and so on, but always about the real world, not an abstract mathematical world: our models are not the reality—a point well made by George Box in his oft-cited remark that "all models are wrong, but some are useful".

In 2016, P.J. Bickel and K.A. Doksum published the second volume of their book on mathematical statistics. The volume includes the quote from Box's Presidential Address, given above. It states that the quote is the best formulation of the "guiding principle of modern statistics".

Additionally, in 2011, a workshop on model selection was held in The Netherlands. The name of the workshop was "All models are wrong...".

Historical antecedents

Although the aphorism seems to have originated with George Box, the underlying concept goes back decades, perhaps centuries. Some exemplifications of that are given below.

In 1960, Georg Rasch said the following.

… no models are [true]—not even the Newtonian laws. When you construct a model you leave out all the details which you, with the knowledge at your disposal, consider inessential…. Models should not be true, but it is important that they are applicable, and whether they are applicable for any given purpose must of course be investigated. This also means that a model is never accepted finally, only on trial.

— Rasch, G. (1960), Probabilistic Models for Some Intelligence and Attainment Tests, Copenhagen: Danmarks Paedagogiske Institut, pp. 37–38; republished in 1980 by University of Chicago Press

In 1947, the mathematician John von Neumann said that "truth … is much too complicated to allow anything but approximations".

In 1942, the French philosopher-poet Paul Valéry said the following.

Ce qui est simple est toujours faux. Ce qui ne l’est pas est inutilisable.

What is simple is always wrong. What is not is unusable.

—Valéry, Paul (1942), Mauvaises pensées et autres, Paris: Éditions Gallimard

In 1939, the founder of statistical process control, Walter Shewhart, said the following.

… no model can ever be theoretically attainable that will completely and uniquely characterize the indefinitely expansible concept of a state of statistical control. What is perhaps even more important, on the basis of a finite portion of the sequence [X₁, X₂, X₃, …]—and we can never have more than a finite portion—we can not reasonably hope to construct a model that will represent exactly any specific characteristic of a particular state of control even though such a state actually exists. Here the situation is much like that in physical science where we find a model of a molecule; any model is always an incomplete though useful picture of the conceived physical thing called a molecule.

— Shewhart, W. A. (1939), Statistical Method From the Viewpoint of Quality Control, U.S. Department of Agriculture, p. 19

In 1923, a related idea was articulated by the artist Pablo Picasso.

We all know that art is not truth. Art is a lie that makes us realize truth, at least the truth that is given us to understand. The artist must know the manner whereby to convince others of the truthfulness of his lies.

— Picasso, Pablo (1923), "Picasso speaks", The Arts, 3: 315–326; reprinted in Barr, Alfred H., Jr. (1939), Picasso: Forty Years of his Art (PDF), Museum of Modern Art, pp. 9–12

Mathematical model

From Wikipedia, the free encyclopedia

A mathematical model is a description of a system using mathematical concepts and language. The process of developing a mathematical model is termed mathematical modeling. Mathematical models are used in the natural sciences (such as physics, biology, earth science, chemistry) and engineering disciplines (such as computer science, electrical engineering), as well as in the social sciences (such as economics, psychology, sociology, political science).

A model may help to explain a system and to study the effects of different components, and to make predictions about behaviour.

Elements of a mathematical model

Mathematical models can take many forms, including dynamical systems, statistical models, differential equations, or game theoretic models. These and other types of models can overlap, with a given model involving a variety of abstract structures. In general, mathematical models may include logical models. In many cases, the quality of a scientific field depends on how well the mathematical models developed on the theoretical side agree with results of repeatable experiments. Lack of agreement between theoretical mathematical models and experimental measurements often leads to important advances as better theories are developed.

In the physical sciences, a traditional mathematical model contains most of the following elements:

Governing equations
Supplementary sub-models
1. Defining equations
2. Constitutive equations
Assumptions and constraints
1. Initial and boundary conditions
2. Classical constraints and kinematic equations

Classifications

Mathematical models are usually composed of relationships and variables. Relationships can be described by operators, such as algebraic operators, functions, differential operators, etc. Variables are abstractions of system parameters of interest, that can be quantified. Several classification criteria can be used for mathematical models according to their structure:

Linear vs. nonlinear: If all the operators in a mathematical model exhibit linearity, the resulting mathematical model is defined as linear. A model is considered to be nonlinear otherwise. The definition of linearity and nonlinearity is dependent on context, and linear models may have nonlinear expressions in them. For example, in a statistical linear model, it is assumed that a relationship is linear in the parameters, but it may be nonlinear in the predictor variables. Similarly, a differential equation is said to be linear if it can be written with linear differential operators, but it can still have nonlinear expressions in it. In a mathematical programming model, if the objective functions and constraints are represented entirely by linear equations, then the model is regarded as a linear model. If one or more of the objective functions or constraints are represented with a nonlinear equation, then the model is known as a nonlinear model.
Nonlinearity, even in fairly simple systems, is often associated with phenomena such as chaos and irreversibility. Although there are exceptions, nonlinear systems and models tend to be more difficult to study than linear ones. A common approach to nonlinear problems is linearization, but this can be problematic if one is trying to study aspects such as irreversibility, which are strongly tied to nonlinearity.
Static vs. dynamic: A dynamic model accounts for time-dependent changes in the state of the system, while a static (or steady-state) model calculates the system in equilibrium, and thus is time-invariant. Dynamic models typically are represented by differential equations or difference equations.
Explicit vs. implicit: If all of the input parameters of the overall model are known, and the output parameters can be calculated by a finite series of computations, the model is said to be explicit. But sometimes it is the output parameters which are known, and the corresponding inputs must be solved for by an iterative procedure, such as Newton's method (if the model is linear) or Broyden's method (if non-linear). In such a case the model is said to be implicit. For example, a jet engine's physical properties such as turbine and nozzle throat areas can be explicitly calculated given a design thermodynamic cycle (air and fuel flow rates, pressures, and temperatures) at a specific flight condition and power setting, but the engine's operating cycles at other flight conditions and power settings cannot be explicitly calculated from the constant physical properties.
Discrete vs. continuous: A discrete model treats objects as discrete, such as the particles in a molecular model or the states in a statistical model; while a continuous model represents the objects in a continuous manner, such as the velocity field of fluid in pipe flows, temperatures and stresses in a solid, and electric field that applies continuously over the entire model due to a point charge.
Deterministic vs. probabilistic (stochastic): A deterministic model is one in which every set of variable states is uniquely determined by parameters in the model and by sets of previous states of these variables; therefore, a deterministic model always performs the same way for a given set of initial conditions. Conversely, in a stochastic model—usually called a "statistical model"—randomness is present, and variable states are not described by unique values, but rather by probability distributions.
Deductive, inductive, or floating: A deductive model is a logical structure based on a theory. An inductive model arises from empirical findings and generalization from them. The floating model rests on neither theory nor observation, but is merely the invocation of expected structure. Application of mathematics in social sciences outside of economics has been criticized for unfounded models. Application of catastrophe theory in science has been characterized as a floating model.

Significance in the natural sciences

Mathematical models are of great importance in the natural sciences, particularly in physics. Physical theories are almost invariably expressed using mathematical models.

Throughout history, more and more accurate mathematical models have been developed. Newton's laws accurately describe many everyday phenomena, but at certain limits theory of relativity and quantum mechanics must be used. Though even these theories can't model or explain all phenomena themselves or together, such as black holes. It is possible to obtain the less accurate models in appropriate limits, for example relativistic mechanics reduces to Newtonian mechanics at speeds much less than the speed of light. Quantum mechanics reduces to classical physics when the quantum numbers are high. For example, the de Broglie wavelength of a tennis ball is insignificantly small, so classical physics is a good approximation to use in this case.

It is common to use idealized models in physics to simplify things. Massless ropes, point particles, ideal gases and the particle in a box are among the many simplified models used in physics. The laws of physics are represented with simple equations such as Newton's laws, Maxwell's equations and the Schrödinger equation. These laws are a basis for making mathematical models of real situations. Many real situations are very complex and thus modeled approximate on a computer, a model that is computationally feasible to compute is made from the basic laws or from approximate models made from the basic laws. For example, molecules can be modeled by molecular orbital models that are approximate solutions to the Schrödinger equation. In engineering, physics models are often made by mathematical methods such as finite element analysis.

Different mathematical models use different geometries that are not necessarily accurate descriptions of the geometry of the universe. Euclidean geometry is much used in classical physics, while special relativity and general relativity are examples of theories that use geometries which are not Euclidean.

Some applications

Since prehistorical times simple models such as maps and diagrams have been used.

Often when engineers analyze a system to be controlled or optimized, they use a mathematical model. In analysis, engineers can build a descriptive model of the system as a hypothesis of how the system could work, or try to estimate how an unforeseeable event could affect the system. Similarly, in control of a system, engineers can try out different control approaches in simulations.

A mathematical model usually describes a system by a set of variables and a set of equations that establish relationships between the variables. Variables may be of many types; real or integer numbers, boolean values or strings, for example. The variables represent some properties of the system, for example, measured system outputs often in the form of signals, timing data, counters, and event occurrence (yes/no). The actual model is the set of functions that describe the relations between the different variables.

Building blocks

In business and engineering, mathematical models may be used to maximize a certain output. The system under consideration will require certain inputs. The system relating inputs to outputs depends on other variables too: decision variables, state variables, exogenous variables, and random variables.

Decision variables are sometimes known as independent variables. Exogenous variables are sometimes known as parameters or constants. The variables are not independent of each other as the state variables are dependent on the decision, input, random, and exogenous variables. Furthermore, the output variables are dependent on the state of the system (represented by the state variables).

Objectives and constraints of the system and its users can be represented as functions of the output variables or state variables. The objective functions will depend on the perspective of the model's user. Depending on the context, an objective function is also known as an index of performance, as it is some measure of interest to the user. Although there is no limit to the number of objective functions and constraints a model can have, using or optimizing the model becomes more involved (computationally) as the number increases.

For example, economists often apply linear algebra when using input-output models. Complicated mathematical models that have many variables may be consolidated by use of vectors where one symbol represents several variables.

A priori information

To analyse something with a typical "black box approach", only the behavior of the stimulus/response will be accounted for, to infer the (unknown) box. The usual representation of this black box system is a data flow diagram centered in the box.

Mathematical modeling problems are often classified into black box or white box models, according to how much a priori information on the system is available. A black-box model is a system of which there is no a priori information available. A white-box model (also called glass box or clear box) is a system where all necessary information is available. Practically all systems are somewhere between the black-box and white-box models, so this concept is useful only as an intuitive guide for deciding which approach to take.

Usually it is preferable to use as much a priori information as possible to make the model more accurate. Therefore, the white-box models are usually considered easier, because if you have used the information correctly, then the model will behave correctly. Often the a priori information comes in forms of knowing the type of functions relating different variables. For example, if we make a model of how a medicine works in a human system, we know that usually the amount of medicine in the blood is an exponentially decaying function. But we are still left with several unknown parameters; how rapidly does the medicine amount decay, and what is the initial amount of medicine in blood? This example is therefore not a completely white-box model. These parameters have to be estimated through some means before one can use the model.

In black-box models one tries to estimate both the functional form of relations between variables and the numerical parameters in those functions. Using a priori information we could end up, for example, with a set of functions that probably could describe the system adequately. If there is no a priori information we would try to use functions as general as possible to cover all different models. An often used approach for black-box models are neural networks which usually do not make assumptions about incoming data. Alternatively the NARMAX (Nonlinear AutoRegressive Moving Average model with eXogenous inputs) algorithms which were developed as part of nonlinear system identification can be used to select the model terms, determine the model structure, and estimate the unknown parameters in the presence of correlated and nonlinear noise. The advantage of NARMAX models compared to neural networks is that NARMAX produces models that can be written down and related to the underlying process, whereas neural networks produce an approximation that is opaque.

Subjective information

Sometimes it is useful to incorporate subjective information into a mathematical model. This can be done based on intuition, experience, or expert opinion, or based on convenience of mathematical form. Bayesian statistics provides a theoretical framework for incorporating such subjectivity into a rigorous analysis: we specify a prior probability distribution (which can be subjective), and then update this distribution based on empirical data.

An example of when such approach would be necessary is a situation in which an experimenter bends a coin slightly and tosses it once, recording whether it comes up heads, and is then given the task of predicting the probability that the next flip comes up heads. After bending the coin, the true probability that the coin will come up heads is unknown; so the experimenter would need to make a decision (perhaps by looking at the shape of the coin) about what prior distribution to use. Incorporation of such subjective information might be important to get an accurate estimate of the probability.

Complexity

In general, model complexity involves a trade-off between simplicity and accuracy of the model. Occam's razor is a principle particularly relevant to modeling, its essential idea being that among models with roughly equal predictive power, the simplest one is the most desirable. While added complexity usually improves the realism of a model, it can make the model difficult to understand and analyze, and can also pose computational problems, including numerical instability. Thomas Kuhn argues that as science progresses, explanations tend to become more complex before a paradigm shift offers radical simplification.

For example, when modeling the flight of an aircraft, we could embed each mechanical part of the aircraft into our model and would thus acquire an almost white-box model of the system. However, the computational cost of adding such a huge amount of detail would effectively inhibit the usage of such a model. Additionally, the uncertainty would increase due to an overly complex system, because each separate part induces some amount of variance into the model. It is therefore usually appropriate to make some approximations to reduce the model to a sensible size. Engineers often can accept some approximations in order to get a more robust and simple model. For example, Newton's classical mechanics is an approximated model of the real world. Still, Newton's model is quite sufficient for most ordinary-life situations, that is, as long as particle speeds are well below the speed of light, and we study macro-particles only.

Training and tuning

Any model which is not pure white-box contains some parameters that can be used to fit the model to the system it is intended to describe. If the modeling is done by an artificial neural network or other machine learning, the optimization of parameters is called training, while the optimization of model hyperparameters is called tuning and often uses cross-validation. In more conventional modeling through explicitly given mathematical functions, parameters are often determined by curve fitting.

Model evaluation

A crucial part of the modeling process is the evaluation of whether or not a given mathematical model describes a system accurately. This question can be difficult to answer as it involves several different types of evaluation.

Fit to empirical data

Usually the easiest part of model evaluation is checking whether a model fits experimental measurements or other empirical data. In models with parameters, a common approach to test this fit is to split the data into two disjoint subsets: training data and verification data. The training data are used to estimate the model parameters. An accurate model will closely match the verification data even though these data were not used to set the model's parameters. This practice is referred to as cross-validation in statistics.

Defining a metric to measure distances between observed and predicted data is a useful tool of assessing model fit. In statistics, decision theory, and some economic models, a loss function plays a similar role.

While it is rather straightforward to test the appropriateness of parameters, it can be more difficult to test the validity of the general mathematical form of a model. In general, more mathematical tools have been developed to test the fit of statistical models than models involving differential equations. Tools from non-parametric statistics can sometimes be used to evaluate how well the data fit a known distribution or to come up with a general model that makes only minimal assumptions about the model's mathematical form.

Scope of the model

Assessing the scope of a model, that is, determining what situations the model is applicable to, can be less straightforward. If the model was constructed based on a set of data, one must determine for which systems or situations the known data is a "typical" set of data.

The question of whether the model describes well the properties of the system between data points is called interpolation, and the same question for events or data points outside the observed data is called extrapolation.

As an example of the typical limitations of the scope of a model, in evaluating Newtonian classical mechanics, we can note that Newton made his measurements without advanced equipment, so he could not measure properties of particles travelling at speeds close to the speed of light. Likewise, he did not measure the movements of molecules and other small particles, but macro particles only. It is then not surprising that his model does not extrapolate well into these domains, even though his model is quite sufficient for ordinary life physics.

Philosophical considerations

Many types of modeling implicitly involve claims about causality. This is usually (but not always) true of models involving differential equations. As the purpose of modeling is to increase our understanding of the world, the validity of a model rests not only on its fit to empirical observations, but also on its ability to extrapolate to situations or data beyond those originally described in the model. One can think of this as the differentiation between qualitative and quantitative predictions. One can also argue that a model is worthless unless it provides some insight which goes beyond what is already known from direct investigation of the phenomenon being studied.

An example of such criticism is the argument that the mathematical models of optimal foraging theory do not offer insight that goes beyond the common-sense conclusions of evolution and other basic principles of ecology.

Examples

One of the popular examples in computer science is the mathematical models of various machines, an example is the deterministic finite automaton (DFA) which is defined as an abstract mathematical concept, but due to the deterministic nature of a DFA, it is implementable in hardware and software for solving various specific problems. For example, the following is a DFA M with a binary alphabet, which requires that the input contains an even number of 0s.

The state diagram for M

M = (Q, Σ, δ, q₀, F) where

Q = {S₁, S₂},
Σ = {0, 1},
q₀ = S₁,
F = {S₁}, and
δ is defined by the following state transition table:

	0	1
S₁	S₂	S₁
S₂	S₁	S₂

The state S₁ represents that there has been an even number of 0s in the input so far, while S₂ signifies an odd number. A 1 in the input does not change the state of the automaton. When the input ends, the state will show whether the input contained an even number of 0s or not. If the input did contain an even number of 0s, M will finish in state S₁, an accepting state, so the input string will be accepted.

The language recognized by M is the regular language given by the regular expression 1*( 0 (1*) 0 (1*) )*, where "*" is the Kleene star, e.g., 1* denotes any non-negative number (possibly zero) of symbols "1".

Many everyday activities carried out without a thought are uses of mathematical models. A geographical map projection of a region of the earth onto a small, plane surface is a model^[7] which can be used for many purposes such as planning travel.
Another simple activity is predicting the position of a vehicle from its initial position, direction and speed of travel, using the equation that distance traveled is the product of time and speed. This is known as dead reckoning when used more formally. Mathematical modeling in this way does not necessarily require formal mathematics; animals have been shown to use dead reckoning.
Population Growth. A simple (though approximate) model of population growth is the Malthusian growth model. A slightly more realistic and largely used population growth model is the logistic function, and its extensions.
Model of a particle in a potential-field. In this model we consider a particle as being a point of mass which describes a trajectory in space which is modeled by a function giving its coordinates in space as a function of time. The potential field is given by a function $V\!:\mathbb {R} ^{3}\!\rightarrow \mathbb {R}$ and the trajectory, that is a function $\mathbf {r} \!:\mathbb {R} \rightarrow \mathbb {R} ^{3}$ , is the solution of the differential equation:

{\displaystyle -{\frac {\mathrm {d} ^{2}\mathbf {r} (t)}{\mathrm {d} t^{2}}}m={\frac {\partial V[\mathbf {r} (t)]}{\partial x}}\mathbf {\hat {x}} +{\frac {\partial V[\mathbf {r} (t)]}{\partial y}}\mathbf {\hat {y}} +{\frac {\partial V[\mathbf {r} (t)]}{\partial z}}\mathbf {\hat {z}} ,}

that can be written also as:

m{\frac {\mathrm {d} ^{2}\mathbf {r} (t)}{\mathrm {d} t^{2}}}=-\nabla V[\mathbf {r} (t)].

Note this model assumes the particle is a point mass, which is certainly known to be false in many cases in which we use this model; for example, as a model of planetary motion.

Model of rational behavior for a consumer. In this model we assume a consumer faces a choice of n commodities labeled 1,2,...,n each with a market price p₁, p₂,..., p_n. The consumer is assumed to have an ordinal utility function U (ordinal in the sense that only the sign of the differences between two utilities, and not the level of each utility, is meaningful), depending on the amounts of commodities x₁, x₂,..., x_n consumed. The model further assumes that the consumer has a budget M which is used to purchase a vector x₁, x₂,..., x_n in such a way as to maximize U(x₁, x₂,..., x_n). The problem of rational behavior in this model then becomes an optimization problem, that is:

\max U(x_{1},x_{2},\ldots ,x_{n})

subject to:

\sum _{i=1}^{n}p_{i}x_{i}\leq M.

x_{i}\geq 0\;\;\;\forall i\in \{1,2,\ldots ,n\}

This model has been used in a wide variety of economic contexts, such as in general equilibrium theory to show existence and Pareto efficiency of economic equilibria.

Neighbour-sensing model explains the mushroom formation from the initially chaotic fungal network.
Computer science: models in Computer Networks, data models, surface model,...
Mechanics: movement of rocket model,...

A Medley of Potpourri

Search This Blog

Tuesday, October 22, 2019

Spurious relationship

Examples

Hypothesis testing

Detecting spurious relationships

Experiments

Non-experimental statistical analyses

Other relationships

All models are wrong

Quotations of George Box

Discussions

Historical antecedents

Mathematical model

Elements of a mathematical model

Classifications

Significance in the natural sciences

Some applications

Building blocks

A priori information

Subjective information

Complexity

Training and tuning

Model evaluation

Fit to empirical data

Scope of the model

Philosophical considerations

Examples

Cryogenics

Followers

Total Pageviews