A Medley of Potpourri

Monday, December 4, 2023

Statistical model

From Wikipedia, the free encyclopedia

https://en.wikipedia.org/wiki/Statistical_model

A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form, the data-generating process. When referring specifically to probabilities, the corresponding term is probabilistic model.

A statistical model is usually specified as a mathematical relationship between one or more random variables and other non-random variables. As such, a statistical model is "a formal representation of a theory" (Herman Adèr quoting Kenneth Bollen).

All statistical hypothesis tests and all statistical estimators are derived via statistical models. More generally, statistical models are part of the foundation of statistical inference.

Introduction

Informally, a statistical model can be thought of as a statistical assumption (or set of statistical assumptions) with a certain property: that the assumption allows us to calculate the probability of any event. As an example, consider a pair of ordinary six-sided dice. We will study two different statistical assumptions about the dice.

The first statistical assumption is this: for each of the dice, the probability of each face (1, 2, 3, 4, 5, and 6) coming up is 1/6. From that assumption, we can calculate the probability of both dice coming up 5: 1/6 × 1/6 = 1/36. More generally, we can calculate the probability of any event: e.g. (1 and 2) or (3 and 3) or (5 and 6).

The alternative statistical assumption is this: for each of the dice, the probability of the face 5 coming up is 1/8 (because the dice are weighted). From that assumption, we can calculate the probability of both dice coming up 5: 1/8 × 1/8 = 1/64. We cannot, however, calculate the probability of any other nontrivial event, as the probabilities of the other faces are unknown.

The first statistical assumption constitutes a statistical model: because with the assumption alone, we can calculate the probability of any event. The alternative statistical assumption does not constitute a statistical model: because with the assumption alone, we cannot calculate the probability of every event.

In the example above, with the first assumption, calculating the probability of an event is easy. With some other examples, though, the calculation can be difficult, or even impractical (e.g. it might require millions of years of computation). For an assumption to constitute a statistical model, such difficulty is acceptable: doing the calculation does not need to be practicable, just theoretically possible.

Formal definition

In mathematical terms, a statistical model is usually thought of as a pair ( $S, P$ ), where $S$ is the set of possible observations, i.e. the sample space, and $P$ is a set of probability distributions on $S$ .

The intuition behind this definition is as follows. It is assumed that there is a "true" probability distribution induced by the process that generates the observed data. We choose $P$ to represent a set (of distributions) which contains a distribution that adequately approximates the true distribution.

Note that we do not require that $P$ contains the true distribution, and in practice that is rarely the case. Indeed, as Burnham & Anderson state, "A model is a simplification or approximation of reality and hence will not reflect all of reality"—hence the saying "all models are wrong".

The set $P$ is almost always parameterized: $P = {F_{θ} : θ \in Θ}$ . The set of distributions $Θ$ defines the parameters of the model. A parameterization is generally required to have distinct parameter values give rise to distinct distributions, i.e. $F_{θ_{1}} = F_{θ_{2}} \Rightarrow θ_{1} = θ_{2}$ must hold (in other words, it must be injective). A parameterization that meets the requirement is said to be identifiable.

An example

Suppose that we have a population of children, with the ages of the children distributed uniformly, in the population. The height of a child will be stochastically related to the age: e.g. when we know that a child is of age 7, this influences the chance of the child being 1.5 meters tall. We could formalize that relationship in a linear regression model, like this: height_i = b₀ + b₁age_i + ε_i, where b₀ is the intercept, b₁ is a parameter that age is multiplied by to obtain a prediction of height, ε_i is the error term, and i identifies the child. This implies that height is predicted by age, with some error.

An admissible model must be consistent with all the data points. Thus, a straight line (height_i = b₀ + b₁age_i) cannot be the equation for a model of the data—unless it exactly fits all the data points, i.e. all the data points lie perfectly on the line. The error term, ε_i, must be included in the equation, so that the model is consistent with all the data points.

To do statistical inference, we would first need to assume some probability distributions for the ε_i. For instance, we might assume that the ε_i distributions are i.i.d. Gaussian, with zero mean. In this instance, the model would have 3 parameters: b₀, b₁, and the variance of the Gaussian distribution.

We can formally specify the model in the form ( $S, P$ ) as follows. The sample space, $S$ , of our model comprises the set of all possible pairs (age, height). Each possible value of $θ$ = (b₀, b₁, σ²) determines a distribution on $S$ ; denote that distribution by $F_{θ}$ . If $Θ$ is the set of all possible values of $θ$ , then $P = {F_{θ} : θ \in Θ}$ . (The parameterization is identifiable, and this is easy to check.)

In this example, the model is determined by (1) specifying $S$ and (2) making some assumptions relevant to $P$ . There are two assumptions: that height can be approximated by a linear function of age; that errors in the approximation are distributed as i.i.d. Gaussian. The assumptions are sufficient to specify $P$ —as they are required to do.

General remarks

A statistical model is a special class of mathematical model. What distinguishes a statistical model from other mathematical models is that a statistical model is non-deterministic. Thus, in a statistical model specified via mathematical equations, some of the variables do not have specific values, but instead have probability distributions; i.e. some of the variables are stochastic. In the above example with children's heights, ε is a stochastic variable; without that stochastic variable, the model would be deterministic.

Statistical models are often used even when the data-generating process being modeled is deterministic. For instance, coin tossing is, in principle, a deterministic process; yet it is commonly modeled as stochastic (via a Bernoulli process).

Choosing an appropriate statistical model to represent a given data-generating process is sometimes extremely difficult, and may require knowledge of both the process and relevant statistical analyses. Relatedly, the statistician Sir David Cox has said, "How [the] translation from subject-matter problem to statistical model is done is often the most critical part of an analysis".

There are three purposes for a statistical model, according to Konishi & Kitagawa.

Predictions
Extraction of information
Description of stochastic structures

Those three purposes are essentially the same as the three purposes indicated by Friendly & Meyer: prediction, estimation, description. The three purposes correspond with the three kinds of logical reasoning: deductive reasoning, inductive reasoning, abductive reasoning.

Dimension of a model

Suppose that we have a statistical model ( $S, P$ ) with $P = {F_{θ} : θ \in Θ}$ . In notation, we write that $Θ \subseteq R^{k}$ where $k$ is a positive integer ( $R$ denotes the real numbers; other sets can be used, in principle). Here, $k$ is called the dimension of the model. The model is said to be parametric if $Θ$ has finite dimension.

As an example, if we assume that data arise from a univariate Gaussian distribution, then we are assuming that

P = {F_{μ, σ} (x) \equiv \frac{1}{\sqrt{2 π} σ} \exp (- \frac{(x - μ)^{2}}{2 σ^{2}}) : μ \in R, σ > 0}

In this example, the dimension, $k$ , equals 2.

As another example, suppose that the data consists of points ( $x$ , $y$ ) that we assume are distributed according to a straight line with i.i.d. Gaussian residuals (with zero mean): this leads to the same statistical model as was used in the example with children's heights. The dimension of the statistical model is 3: the intercept of the line, the slope of the line, and the variance of the distribution of the residuals. (Note the set of all possible lines has dimension 2, even though geometrically, a line has dimension 1.)

Although formally $θ \in Θ$ is a single parameter that has dimension $k$ , it is sometimes regarded as comprising $k$ separate parameters. For example, with the univariate Gaussian distribution, $θ$ is formally a single parameter with dimension 2, but it is often regarded as comprising 2 separate parameters—the mean and the standard deviation.

A statistical model is nonparametric if the parameter set $Θ$ is infinite dimensional. A statistical model is semiparametric if it has both finite-dimensional and infinite-dimensional parameters. Formally, if $k$ is the dimension of $Θ$ and $n$ is the number of samples, both semiparametric and nonparametric models have $k \to \infty$ as $n \to \infty$ . If $k / n \to 0$ as $n \to \infty$ , then the model is semiparametric; otherwise, the model is nonparametric.

Parametric models are by far the most commonly used statistical models. Regarding semiparametric and nonparametric models, Sir David Cox has said, "These typically involve fewer assumptions of structure and distributional form but usually contain strong assumptions about independencies".

Nested models

Two statistical models are nested if the first model can be transformed into the second model by imposing constraints on the parameters of the first model. As an example, the set of all Gaussian distributions has, nested within it, the set of zero-mean Gaussian distributions: we constrain the mean in the set of all Gaussian distributions to get the zero-mean distributions. As a second example, the quadratic model

y = b 0 + b 1 x + b 2 x 2 + ε, ε ~ 𝒩(0, σ 2)

has, nested within it, the linear model

y = b 0 + b 1 x + ε, ε ~ 𝒩(0, σ 2)

—we constrain the parameter $b 2$ to equal 0.

In both those examples, the first model has a higher dimension than the second model (for the first example, the zero-mean model has dimension 1). Such is often, but not always, the case. As a example where they have the same dimension, the set of positive-mean Gaussian distributions is nested within the set of all Gaussian distributions; they both have dimension 2.

Comparing models

Comparing statistical models is fundamental for much of statistical inference. Indeed, Konishi & Kitagawa (2008, p. 75) state this: "The majority of the problems in statistical inference can be considered to be problems related to statistical modeling. They are typically formulated as comparisons of several statistical models."

Common criteria for comparing models include the following: R², Bayes factor, Akaike information criterion, and the likelihood-ratio test together with its generalization, the relative likelihood.

Volatile organic compound

From Wikipedia, the free encyclopedia

https://en.wikipedia.org/wiki/Volatile_organic_compound

Volatile organic compounds (VOCs) are organic compounds that have a high vapor pressure at room temperature. High vapor pressure correlates with a low boiling point, which relates to the number of the sample's molecules in the surrounding air, a trait known as volatility.

VOCs are responsible for the odor of scents and perfumes as well as pollutants. VOCs play an important role in communication between animals and plants, e.g. attractants for pollinators, protection from predation, and even inter-plant interactions. Some VOCs are dangerous to human health or cause harm to the environment. Anthropogenic VOCs are regulated by law, especially indoors, where concentrations are the highest. Most VOCs are not acutely toxic, but may have long-term chronic health effects. Some VOCs have been used in pharmacy, while others are target of administrative controls because of their recreational use.

Definitions

Diverse definitions of the term VOC are in use. Some examples are presented below.

Canada

Health Canada classifies VOCs as organic compounds that have boiling points roughly in the range of 50 to 250 °C (122 to 482 °F). The emphasis is placed on commonly encountered VOCs that would have an effect on air quality.

European Union

The European Union defines a VOC as "any organic compound as well as the fraction of creosote, having at 293.15 K a vapour pressure of 0,01 kPa or more, or having a corresponding volatility under the particular conditions of use". The VOC Solvents Emissions Directive was the main policy instrument for the reduction of industrial emissions of volatile organic compounds (VOCs) in the European Union. It covers a wide range of solvent-using activities, e.g. printing, surface cleaning, vehicle coating, dry cleaning and manufacture of footwear and pharmaceutical products. The VOC Solvents Emissions Directive requires installations in which such activities are applied to comply either with the emission limit values set out in the Directive or with the requirements of the so-called reduction scheme. Article 13 of The Paints Directive, approved in 2004, amended the original VOC Solvents Emissions Directive and limits the use of organic solvents in decorative paints and varnishes and in vehicle finishing products. The Paints Directive sets out maximum VOC content limit values for paints and varnishes in certain applications. The Solvents Emissions Directive was replaced by the Industrial Emissions Directive from 2013.

China

The People's Republic of China defines a VOC as those compounds that have "originated from automobiles, industrial production and civilian use, burning of all types of fuels, storage and transportation of oils, fitment finish, coating for furniture and machines, cooking oil fume and fine particles (PM 2.5)", and similar sources. The Three-Year Action Plan for Winning the Blue Sky Defence War released by the State Council in July 2018 creates an action plan to reduce 2015 VOC emissions 10% by 2020.

India

The Central Pollution Control Board of India released the Air (Prevention and Control of Pollution) Act in 1981, amended in 1987, to address concerns about air pollution in India. While the document does not differentiate between VOCs and other air pollutants, the CPCB monitors "oxides of nitrogen (NO_x), sulphur dioxide (SO₂), fine particulate matter (PM10) and suspended particulate matter (SPM)".

United States

Thermal oxidizers provide an air pollution abatement option for VOCs from industrial air flows. A thermal oxidizer is an EPA-approved device to treat VOCs.

The definitions of VOCs used for control of precursors of photochemical smog used by the U.S. Environmental Protection Agency (EPA) and state agencies in the US with independent outdoor air pollution regulations include exemptions for VOCs that are determined to be non-reactive, or of low-reactivity in the smog formation process. Prominent is the VOC regulation issued by the South Coast Air Quality Management District in California and by the California Air Resources Board (CARB). However, this specific use of the term VOCs can be misleading, especially when applied to indoor air quality because many chemicals that are not regulated as outdoor air pollution can still be important for indoor air pollution.

Following a public hearing in September 1995, California's ARB uses the term "reactive organic gases" (ROG) to measure organic gases. The CARB revised the definition of "Volatile Organic Compounds" used in their consumer products regulations, based on the committee's findings.

In addition to drinking water, VOCs are regulated in pollutant discharges to surface waters (both directly and via sewage treatment plants) as hazardous waste, but not in non-industrial indoor air. The Occupational Safety and Health Administration (OSHA) regulates VOC exposure in the workplace. Volatile organic compounds that are classified as hazardous materials are regulated by the Pipeline and Hazardous Materials Safety Administration while being transported.

Biologically generated VOCs

Limonene, a common biogenic VOC, is emitted into the atmosphere primarily by trees which grow in coniferous forests.

Most VOCs in Earth's atmosphere are biogenic, largely emitted by plants.

Major biogenic VOCs
compound	relative contribution	amount emitted (Tg/y)
isoprene	62.2%	594±34
terpenes	10.9%	95±3
pinene isomers	5.6%	48.7±0.8
sesquiterpenes	2.4%	20±1
methanol	6.4%	130±4

Biogenic volatile organic compounds (BVOCs) encompass VOCs emitted by plants, animals, or microorganisms, and while extremely diverse, are most commonly terpenoids, alcohols, and carbonyls (methane and carbon monoxide are generally not considered). Not counting methane, biological sources emit an estimated 760 teragrams of carbon per year in the form of VOCs. The majority of VOCs are produced by plants, the main compound being isoprene. Small amounts of VOCs are produced by animals and microbes. Many VOCs are considered secondary metabolites, which often help organisms in defense, such as plant defense against herbivory. The strong odor emitted by many plants consists of green leaf volatiles, a subset of VOCs. Although intended for nearby organisms to detect and respond to, these volatiles can be detected and communicated through wireless electronic transmission, by embedding nanosensors and infrared transmitters into the plant materials themselves.

Emissions are affected by a variety of factors, such as temperature, which determines rates of volatilization and growth, and sunlight, which determines rates of biosynthesis. Emission occurs almost exclusively from the leaves, the stomata in particular. VOCs emitted by terrestrial forests are often oxidized by hydroxyl radicals in the atmosphere; in the absence of NO_x pollutants, VOC photochemistry recycles hydroxyl radicals to create a sustainable biosphere-atmosphere balance. Due to recent climate change developments, such as warming and greater UV radiation, BVOC emissions from plants are generally predicted to increase, thus upsetting the biosphere-atmosphere interaction and damaging major ecosystems. A major class of VOCs is the terpene class of compounds, such as myrcene.

Providing a sense of scale, a forest 62,000 square kilometres (24,000 sq mi) in area, the size of the US state of Pennsylvania, is estimated to emit 3,400,000 kilograms (7,500,000 lb) of terpenes on a typical August day during the growing season. Researchers investigating mechanisms of induction of genes producing volatile organic compounds, and the subsequent increase in volatile terpenes, has been achieved in maize using (Z)-3-hexen-1-ol and other plant hormones.

Anthropogenic sources

Anthropogenic sources emit about 142 teragrams (1.42 × 10¹¹ kg) of carbon per year in the form of VOCs.

The major source of man-made VOCs are:

Fossil fuel use and production, e.g. incompletely combusted fossil fuels or unintended evaporation of fuels. The most prevalent VOC is ethane, a relatively inert compound.
Solvents used in coatings, paints, and inks. Approximately 12 billion litres of paint are produced annually. Typical solvents include aliphatic hydrocarbons, ethyl acetate, glycol ethers, and acetone. Motivated by cost, environmental concerns, and regulation, the paint and coating industries are increasingly shifting toward aqueous solvents.
Compressed aerosol products, mainly butane and propane, estimated to contribute 1.3 billion tonnes of VOC emissions per year globally.
Biofuel use, e.g., cooking oils in Asia and bioethanol in Brazil.
Biomass combustion, especially from rain forests. Although combustion principally releases carbon dioxide and water, incomplete combustion affords a variety of VOCs.

Indoor VOCs

Concentrations of VOCs in indoor air may be 2 to 5 times greater than in outdoor air, sometimes far greater. During certain activities, indoor levels of VOCs may reach 1,000 times that of the outside air. Studies have shown that emissions of individual VOC species are not that high in an indoor environment, but the total concentration of all VOCs (TVOC) indoors can be up to five times higher than that of outdoor levels.

New buildings experience particularly high levels of VOC off-gassing indoors because of the abundant new materials (building materials, fittings, surface coverings and treatments such as glues, paints and sealants) exposed to the indoor air, emitting multiple VOC gases. This off-gassing has a multi-exponential decay trend that is discernible over at least two years, with the most volatile compounds decaying with a time-constant of a few days, and the least volatile compounds decaying with a time-constant of a few years.

New buildings may require intensive ventilation for the first few months, or a bake-out treatment. Existing buildings may be replenished with new VOC sources, such as new furniture, consumer products, and redecoration of indoor surfaces, all of which lead to a continuous background emission of TVOCs, and requiring improved ventilation.

Numerous studies show strong seasonal variations in indoors VOC emissions, with emission rates increasing in summer. This is largely due to the rate of diffusion of VOC species through materials to the surface, increasing with temperature. Most studies have shown that this leads to generally higher concentrations of TVOCs indoors in summer.

Indoor air quality measurements

Measurement of VOCs from the indoor air is done with sorption tubes e. g. Tenax (for VOCs and SVOCs) or DNPH-cartridges (for carbonyl-compounds) or air detector. The VOCs adsorb on these materials and are afterwards desorbed either thermally (Tenax) or by elution (DNPH) and then analyzed by GC-MS/FID or HPLC. Reference gas mixtures are required for quality control of these VOC-measurements. Furthermore, VOC emitting products used indoors, e.g. building products and furniture, are investigated in emission test chambers under controlled climatic conditions. For quality control of these measurements round robin tests are carried out, therefore reproducibly emitting reference materials are ideally required. Other methods have used proprietary Silcosteel-coated canisters with constant flow inlets to collect samples over several days. These methods are not limited by the adsorbing properties of materials like Tenax.

Regulation of indoor VOC emissions

In most countries, a separate definition of VOCs is used with regard to indoor air quality that comprises each organic chemical compound that can be measured as follows: adsorption from air on Tenax TA, thermal desorption, gas chromatographic separation over a 100% nonpolar column (dimethylpolysiloxane). VOC (volatile organic compounds) are all compounds that appear in the gas chromatogram between and including n-hexane and n-hexadecane. Compounds appearing earlier are called VVOC (very volatile organic compounds); compounds appearing later are called SVOC (semi-volatile organic compounds).

France, Germany (AgBB/DIBt), Belgium, Norway (TEK regulation), and Italy (CAM Edilizia) have enacted regulations to limit VOC emissions from commercial products. European industry has developed numerous voluntary ecolabels and rating systems, such as EMICODE, M1, Blue Angel, GuT (textile floor coverings), Nordic Swan Ecolabel, EU Ecolabel, and Indoor Air Comfort. In the United States, several standards exist; California Standard CDPH Section 01350 is the most common one. These regulations and standards changed the marketplace, leading to an increasing number of low-emitting products.

Health risks

Respiratory, allergic, or immune effects in infants or children are associated with man-made VOCs and other indoor or outdoor air pollutants.

Some VOCs, such as styrene and limonene, can react with nitrogen oxides or with ozone to produce new oxidation products and secondary aerosols, which can cause sensory irritation symptoms. VOCs contribute to the formation of tropospheric ozone and smog.

Health effects include eye, nose, and throat irritation; headaches, loss of coordination, nausea; and damage to the liver, kidney, and central nervous system. Some organics can cause cancer in animals; some are suspected or known to cause cancer in humans. Key signs or symptoms associated with exposure to VOCs include conjunctival irritation, nose and throat discomfort, headache, allergic skin reaction, dyspnea, declines in serum cholinesterase levels, nausea, vomiting, nose bleeding, fatigue, dizziness.

The ability of organic chemicals to cause health effects varies greatly from those that are highly toxic to those with no known health effects. As with other pollutants, the extent and nature of the health effect will depend on many factors including level of exposure and length of time exposed. Eye and respiratory tract irritation, headaches, dizziness, visual disorders, and memory impairment are among the immediate symptoms that some people have experienced soon after exposure to some organics. At present, not much is known about what health effects occur from the levels of organics usually found in homes.

Ingestion

While null in comparison to the concentrations found in indoor air, benzene, toluene, and methyl tert-butyl ether (MTBE) were found in samples of human milk and increase the concentrations of VOCs that we are exposed to throughout the day. A study notes the difference between VOCs in alveolar breath and inspired air suggesting that VOCs are ingested, metabolized, and excreted via the extra-pulmonary pathway. VOCs are also ingested by drinking water in varying concentrations. Some VOC concentrations were over the EPA’s National Primary Drinking Water Regulations and China’s National Drinking Water Standards set by the Ministry of Ecology and Environment.

Dermal absorption

The presence of VOCs in the air and in groundwater has prompted more studies. Several studies have been performed to measure the effects of dermal absorption of specific VOCs. Dermal exposure to VOCs like formaldehyde and toluene downregulate antimicrobial peptides on the skin like cathelicidin LL-37, human β-defensin 2 and 3. Xylene and formaldehyde worsen allergic inflammation in animal models. Toluene also increases the dysregulation of filaggrin: a key protein in dermal regulation. this was confirmed by immunofluorescence to confirm protein loss and western blotting to confirm mRNA loss. These experiments were done on human skin samples. Toluene exposure also decreased the water in the trans-epidermal layer allowing for vulnerability in the skin’s layers.

Limit values for VOC emissions

Limit values for VOC emissions into indoor air are published by AgBB, AFSSET, California Department of Public Health, and others. These regulations have prompted several companies in the paint and adhesive industries to adapt with VOC level reductions their products. VOC labels and certification programs may not properly assess all of the VOCs emitted from the product, including some chemical compounds that may be relevant for indoor air quality. Each ounce of colorant added to tint paint may contain between 5 and 20 grams of VOCs. A dark color, however, could require 5–15 ounces of colorant, adding up to 300 or more grams of VOCs per gallon of paint.

VOCs in healthcare settings

VOCs are also found in hospital and health care environments. In these settings, these chemicals are widely used for cleaning, disinfection, and hygiene of the different areas. Thus, health professionals such as nurses, doctors, sanitation staff, etc., may present with adverse health effects such as asthma; however, further evaluation is required to determine the exact levels and determinants that influence the exposure to these compounds.

Studies have shown that the concentration levels of different VOCs such as halogenated and aromatic hydrocarbons differ substantially between areas of the same hospital. However, one of these studies reported that ethanol, isopropanol, ether, and acetone were the main compounds in the interior of the site. Following the same line, in a study conducted in the United States, it was established that nursing assistants are the most exposed to compounds such as ethanol, while medical equipment preparers are most exposed to 2-propanol.

In relation to exposure to VOCs by cleaning and hygiene personnel, a study conducted in 4 hospitals in the United States established that sterilization and disinfection workers are linked to exposures to d-limonene and 2-propanol, while those responsible for cleaning with chlorine-containing products are more likely to have higher levels of exposure to α-pinene and chloroform. Those who perform floor and other surface cleaning tasks (e.g., floor waxing) and who use quaternary ammonium, alcohol, and chlorine-based products are associated with a higher VOC exposure than the two previous groups, that is, they are particularly linked to exposure to acetone, chloroform, α-pinene, 2-propanol or d-limonene.

Other healthcare environments such as nursing and age care homes have been rarely a subject of study, even though the elderly and vulnerable populations may spend considerable time in these indoor settings where they might be exposed to VOCs, derived from the common use of cleaning agents, sprays and fresheners. In a study conducted in France, a team of researchers developed an online questionnaire for different social and age care facilities, asking about cleaning practices, products used, and the frequency of these activities. As a result, more than 200 chemicals were identified, of which 41 are known to have adverse health effects, 37 of them being VOCs. The health effects include skin sensitization, reproductive and organ-specific toxicity, carcinogenicity, mutagenicity, and endocrine-disrupting properties. Furthermore, in another study carried out in the same European country, it was found that there is a significant association between breathlessness in the elderly population and elevated exposure to VOCs such as toluene and o-xylene, unlike the remainder of the population.

Analytical methods

Sampling

Obtaining samples for analysis is challenging. VOCs, even when at dangerous levels, are dilute, so preconcentration is typically required. Many components of the atmosphere are mutually incompatible, e.g. ozone and organic compounds, peroxyacyl nitrates and many organic compounds. Furthermore, collection of VOCs by condensation in cold traps also accumulates a large amount of water, which generally must be removed selectively, depending on the analytical techniques to be employed. Solid-phase microextraction (SPME) techniques are used to collect VOCs at low concentrations for analysis. As applied to breath analysis, the following modalities are employed for sampling: gas sampling bags, syringes, evacuated steel and glass containers.

Principle and measurement methods

In the U.S., standard methods have been established by the National Institute for Occupational Safety and Health (NIOSH) and another by U.S. OSHA. Each method uses a single component solvent; butanol and hexane cannot be sampled, however, on the same sample matrix using the NIOSH or OSHA method.

VOCs are quantified and identified by two broad techniques. The major technique is gas chromatography (GC). GC instruments allow the separation of gaseous components. When coupled to a flame ionization detector (FID) GCs can detect hydrocarbons at the parts per trillion levels. Using electron capture detectors, GCs are also effective for organohalide such as chlorocarbons.

The second major technique associated with VOC analysis is mass spectrometry, which is usually coupled with GC, giving the hyphenated technique of GC-MS.

Direct injection mass spectrometry techniques are frequently utilized for the rapid detection and accurate quantification of VOCs. PTR-MS is among the methods that have been used most extensively for the on-line analysis of biogenic and anthropogenic VOCs. PTR-MS instruments based on time-of-flight mass spectrometry have been reported to reach detection limits of 20 pptv after 100 ms and 750 ppqv after 1 min. measurement (signal integration) time. The mass resolution of these devices is between 7000 and 10,500 m/Δm, thus it is possible to separate most common isobaric VOCs and quantify them independently.

Chemical fingerprinting and breath analysis

The exhaled human breath contains a few thousand volatile organic compounds and is used in breath biopsy to serve as a VOC biomarker to test for diseases, such as lung cancer. One study has shown that "volatile organic compounds ... are mainly blood borne and therefore enable monitoring of different processes in the body." And it appears that VOC compounds in the body "may be either produced by metabolic processes or inhaled/absorbed from exogenous sources" such as environmental tobacco smoke. Chemical fingerprinting and breath analysis of volatile organic compounds has also been demonstrated with chemical sensor arrays, which utilize pattern recognition for detection of component volatile organics in complex mixtures such as breath gas.

Metrology for VOC measurements

To achieve comparability of VOC measurements, reference standards traceable to SI-units are required. For a number of VOCs gaseous reference standards are available from specialty gas suppliers or national metrology institutes, either in the form of cylinders or dynamic generation methods. However, for many VOCs, such as oxygenated VOCs, monoterpenes, or formaldehyde, no standards are available at the appropriate amount of fraction due to the chemical reactivity or adsorption of these molecules. Currently, several national metrology institutes are working on the lacking standard gas mixtures at trace level concentration, minimising adsorption processes, and improving the zero gas. The final scopes are for the traceability and the long-term stability of the standard gases to be in accordance with the data quality objectives (DQO, maximum uncertainty of 20% in this case) required by the WMO/GAW program.

Gaussian process

From Wikipedia, the free encyclopedia

https://en.wikipedia.org/wiki/Gaussian_process

In probability theory and statistics, a Gaussian process is a stochastic process (a collection of random variables indexed by time or space), such that every finite collection of those random variables has a multivariate normal distribution, i.e. every finite linear combination of them is normally distributed. The distribution of a Gaussian process is the joint distribution of all those (infinitely many) random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

The concept of Gaussian processes is named after Carl Friedrich Gauss because it is based on the notion of the Gaussian distribution (normal distribution). Gaussian processes can be seen as an infinite-dimensional generalization of multivariate normal distributions.

Gaussian processes are useful in statistical modelling, benefiting from properties inherited from the normal distribution. For example, if a random process is modelled as a Gaussian process, the distributions of various derived quantities can be obtained explicitly. Such quantities include the average value of the process over a range of times and the error in estimating the average using sample values at a small set of times. While exact models often scale poorly as the amount of data increases, multiple approximation methods have been developed which often retain good accuracy while drastically reducing computation time.

Definition

A time continuous stochastic process ${X_{t}; t \in T}$ is Gaussian if and only if for every finite set of indices $t_{1}, \dots, t_{k}$ in the index set $T$

X_{t_{1}, \dots, t_{k}} = (X_{t_{1}}, \dots, X_{t_{k}})

is a multivariate Gaussian random variable. That is the same as saying every linear combination of $(X_{t_{1}}, \dots, X_{t_{k}})$ has a univariate normal (or Gaussian) distribution.

Using characteristic functions of random variables, the Gaussian property can be formulated as follows: ${X_{t}; t \in T}$ is Gaussian if and only if, for every finite set of indices $t_{1}, \dots, t_{k}$ , there are real-valued $σ_{ℓ j}$ , $μ_{ℓ}$ with $σ_{j j} > 0$ such that the following equality holds for all $s_{1}, s_{2}, \dots, s_{k} \in R$

E (\exp (i \sum_{ℓ = 1}^{k} s_{ℓ} X_{t_{ℓ}})) = \exp (- \frac{1}{2} \sum_{ℓ, j} σ_{ℓ j} s_{ℓ} s_{j} + i \sum_{ℓ} μ_{ℓ} s_{ℓ}),

where $i$ denotes the imaginary unit such that $i^{2} = - 1$ .

The numbers $σ_{ℓ j}$ and $μ_{ℓ}$ can be shown to be the covariances and means of the variables in the process.

Variance

The variance of a Gaussian process is finite at any time $t$ , formally

var [X (t)] = E [{| X (t) - E [X (t)] |}^{2}] < \infty for all t \in T .

Stationarity

For general stochastic processes strict-sense stationarity implies wide-sense stationarity but not every wide-sense stationary stochastic process is strict-sense stationary. However, for a Gaussian stochastic process the two concepts are equivalent.

A Gaussian stochastic process is strict-sense stationary if, and only if, it is wide-sense stationary.

Example

There is an explicit representation for stationary Gaussian processes. A simple example of this representation is

X_{t} = \cos (a t) ξ_{1} + \sin (a t) ξ_{2}

where $ξ_{1}$ and $ξ_{2}$ are independent random variables with the standard normal distribution.

Covariance functions

A key fact of Gaussian processes is that they can be completely defined by their second-order statistics. Thus, if a Gaussian process is assumed to have mean zero, defining the covariance function completely defines the process' behaviour. Importantly the non-negative definiteness of this function enables its spectral decomposition using the Karhunen–Loève expansion. Basic aspects that can be defined through the covariance function are the process' stationarity, isotropy, smoothness and periodicity.

Stationarity refers to the process' behaviour regarding the separation of any two points $x$ and $x^{'}$ . If the process is stationary, the covariance function depends only on $x - x^{'}$ . For example, the Ornstein–Uhlenbeck process is stationary.

If the process depends only on $| x - x^{'} |$ , the Euclidean distance (not the direction) between $x$ and $x^{'}$ , then the process is considered isotropic. A process that is concurrently stationary and isotropic is considered to be homogeneous; in practice these properties reflect the differences (or rather the lack of them) in the behaviour of the process given the location of the observer.

Ultimately Gaussian processes translate as taking priors on functions and the smoothness of these priors can be induced by the covariance function. If we expect that for "near-by" input points $x$ and $x^{'}$ their corresponding output points $y$ and $y^{'}$ to be "near-by" also, then the assumption of continuity is present. If we wish to allow for significant displacement then we might choose a rougher covariance function. Extreme examples of the behaviour is the Ornstein–Uhlenbeck covariance function and the squared exponential where the former is never differentiable and the latter infinitely differentiable.

Periodicity refers to inducing periodic patterns within the behaviour of the process. Formally, this is achieved by mapping the input $x$ to a two dimensional vector $u (x) = (\cos (x), \sin (x))$ .

Usual covariance functions

There are a number of common covariance functions:

Constant : $K_{C} (x, x^{'}) = C$
Linear: $K_{L} (x, x^{'}) = x^{T} x^{'}$
white Gaussian noise: $K_{GN} (x, x^{'}) = σ^{2} δ_{x, x^{'}}$
Squared exponential: $K_{SE} (x, x^{'}) = \exp (- \frac{| d |^{2}}{2 ℓ^{2}})$
Ornstein–Uhlenbeck: $K_{OU} (x, x^{'}) = \exp (- \frac{| d |}{ℓ})$
Matérn: $K_{Matern} (x, x^{'}) = \frac{2^{1 - ν}}{Γ (ν)} {(\frac{\sqrt{2 ν} | d |}{ℓ})}^{ν} K_{ν} (\frac{\sqrt{2 ν} | d |}{ℓ})$
Periodic: $K_{P} (x, x^{'}) = \exp (- \frac{2 \sin^{2} (\frac{d}{2})}{ℓ^{2}})$
Rational quadratic: $K_{RQ} (x, x^{'}) = {(1 + | d |^{2})}^{- α}, α \geq 0$

Here $d = x - x^{'}$ . The parameter $ℓ$ is the characteristic length-scale of the process (practically, "how close" two points $x$ and $x^{'}$ have to be to influence each other significantly), $δ$ is the Kronecker delta and $σ$ the standard deviation of the noise fluctuations. Moreover, $K_{ν}$ is the modified Bessel function of order $ν$ and $Γ (ν)$ is the gamma function evaluated at $ν$ . Importantly, a complicated covariance function can be defined as a linear combination of other simpler covariance functions in order to incorporate different insights about the data-set at hand.

The inferential results are dependent on the values of the hyperparameters $θ$ (e.g. $ℓ$ and $σ$ ) defining the model's behaviour. A popular choice for $θ$ is to provide maximum a posteriori (MAP) estimates of it with some chosen prior. If the prior is very near uniform, this is the same as maximizing the marginal likelihood of the process; the marginalization being done over the observed process values $y$ . This approach is also known as maximum likelihood II, evidence maximization, or empirical Bayes.

Continuity

For a Gaussian process, continuity in probability is equivalent to mean-square continuity, and continuity with probability one is equivalent to sample continuity. The latter implies, but is not implied by, continuity in probability. Continuity in probability holds if and only if the mean and autocovariance are continuous functions. In contrast, sample continuity was challenging even for stationary Gaussian processes (as probably noted first by Andrey Kolmogorov), and more challenging for more general processes. As usual, by a sample continuous process one means a process that admits a sample continuous modification.

Stationary case

For a stationary Gaussian process $X = (X_{t})_{t \in R},$ some conditions on its spectrum are sufficient for sample continuity, but fail to be necessary. A necessary and sufficient condition, sometimes called Dudley–Fernique theorem, involves the function $σ$ defined by

σ (h) = \sqrt{E (X (t + h) - X (t))^{2}}

(the right-hand side does not depend on

t

due to stationarity). Continuity of

X

in probability is equivalent to continuity of

σ

0.

When convergence of

σ (h)

0

(as

h \to 0

) is too slow, sample continuity of

X

may fail. Convergence of the following integrals matters:

I (σ) = \int_{0}^{1} \frac{σ (h)}{h \sqrt{\log (1 / h)}} d h = \int_{0}^{\infty} 2 σ (e^{- x^{2}}) d x,

these two integrals being equal according to integration by substitution

h = e^{- x^{2}},

x = \sqrt{\log (1 / h)} .

The first integrand need not be bounded as

h \to 0 +,

thus the integral may converge (

I (σ) < \infty

) or diverge (

I (σ) = \infty

). Taking for example

σ (e^{- x^{2}}) = \frac{1}{x^{a}}

for large

x,

that is,

σ (h) = (\log (1 / h))^{- a / 2}

for small

h,

one obtains

I (σ) < \infty

when

a > 1,

and

I (σ) = \infty

when

0 < a \leq 1.

In these two cases the function

σ

is increasing on

[0, \infty),

but generally it is not. Moreover, the condition

(*) there exists

ε > 0

such that

σ

is monotone on

[0, ε]

does not follow from continuity of $σ$ and the evident relations $σ (h) \geq 0$ (for all $h$ ) and $σ (0) = 0.$

Theorem 1 — Let $σ$ be continuous and satisfy (*). Then the condition $I (σ) < \infty$ is necessary and sufficient for sample continuity of $X .$

Some history. Sufficiency was announced by Xavier Fernique in 1964, but the first proof was published by Richard M. Dudley in 1967. Necessity was proved by Michael B. Marcus and Lawrence Shepp in 1970.

There exist sample continuous processes $X$ such that $I (σ) = \infty;$ they violate condition (*). An example found by Marcus and Shepp is a random lacunary Fourier series

X_{t} = \sum_{n = 1}^{\infty} c_{n} (ξ_{n} \cos λ_{n} t + η_{n} \sin λ_{n} t),

where

ξ_{1}, η_{1}, ξ_{2}, η_{2}, \dots

are independent random variables with standard normal distribution; frequencies

0 < λ_{1} < λ_{2} < \dots

are a fast growing sequence; and coefficients

c_{n} > 0

satisfy

\sum_{n} c_{n} < \infty .

The latter relation implies

E \sum_{n} c_{n} (| ξ_{n} | + | η_{n} |) = \sum_{n} c_{n} E (| ξ_{n} | + | η_{n} |) = const \cdot \sum_{n} c_{n} < \infty,

whence

\sum_{n} c_{n} (| ξ_{n} | + | η_{n} |) < \infty

almost surely, which ensures uniform convergence of the Fourier series almost surely, and sample continuity of

X .

Its autocovariation function

E X_{t} X_{t + h} = \sum_{n = 1}^{\infty} c_{n}^{2} \cos λ_{n} h

is nowhere monotone (see the picture), as well as the corresponding function

σ,

σ (h) = \sqrt{2 E X_{t} X_{t} - 2 E X_{t} X_{t + h}} = 2 \sqrt{\sum_{n = 1}^{\infty} c_{n}^{2} \sin^{2} \frac{λ_{n} h}{2}} .

Brownian motion as the integral of Gaussian processes

A Wiener process (also known as Brownian motion) is the integral of a white noise generalized Gaussian process. It is not stationary, but it has stationary increments.

The Ornstein–Uhlenbeck process is a stationary Gaussian process.

The Brownian bridge is (like the Ornstein–Uhlenbeck process) an example of a Gaussian process whose increments are not independent.

The fractional Brownian motion is a Gaussian process whose covariance function is a generalisation of that of the Wiener process.

Driscoll's zero-one law

Driscoll's zero-one law is a result characterizing the sample functions generated by a Gaussian process.

Let $f$ be a mean-zero Gaussian process ${X_{t}; t \in T}$ with non-negative definite covariance function $K$ . Let $H (R)$ be a Reproducing kernel Hilbert space with positive definite kernel $R$ .

Then

lim_{n \to \infty} tr [K_{n} R_{n}^{- 1}] < \infty,

where

K_{n}

and

R_{n}

are the covariance matrices of all possible pairs of

n

points, implies

Pr [f \in H (R)] = 1.

Moreover,

lim_{n \to \infty} tr [K_{n} R_{n}^{- 1}] = \infty

implies

Pr [f \in H (R)] = 0.

This has significant implications when $K = R$ , as

lim_{n \to \infty} tr [R_{n} R_{n}^{- 1}] = lim_{n \to \infty} tr [I] = lim_{n \to \infty} n = \infty .

As such, almost all sample paths of a mean-zero Gaussian process with positive definite kernel $K$ will lie outside of the Hilbert space $H (K)$ .

Linearly constrained Gaussian processes

For many applications of interest some pre-existing knowledge about the system at hand is already given. Consider e.g. the case where the output of the Gaussian process corresponds to a magnetic field; here, the real magnetic field is bound by Maxwell's equations and a way to incorporate this constraint into the Gaussian process formalism would be desirable as this would likely improve the accuracy of the algorithm.

A method on how to incorporate linear constraints into Gaussian processes already exists:

Consider the (vector valued) output function $f (x)$ which is known to obey the linear constraint (i.e. $F_{X}$ is a linear operator)

F_{X} (f (x)) = 0.

Then the constraint

F_{X}

can be fulfilled by choosing

f (x) = G_{X} (g (x))

, where

g (x) \sim G P (μ_{g}, K_{g})

is modelled as a Gaussian process, and finding

G_{X}

such that

F_{X} (G_{X} (g)) = 0 \forall g .

Given

G_{X}

and using the fact that Gaussian processes are closed under linear transformations, the Gaussian process for

f

obeying constraint

F_{X}

becomes

f (x) = G_{X} g \sim G P (G_{X} μ_{g}, G_{X} K_{g} G_{X^{'}}^{T}) .

Hence, linear constraints can be encoded into the mean and covariance function of a Gaussian process.

Applications

A Gaussian process can be used as a prior probability distribution over functions in Bayesian inference. Given any set of N points in the desired domain of your functions, take a multivariate Gaussian whose covariance matrix parameter is the Gram matrix of your N points with some desired kernel, and sample from that Gaussian. For solution of the multi-output prediction problem, Gaussian process regression for vector-valued function was developed. In this method, a 'big' covariance is constructed, which describes the correlations between all the input and output variables taken in N points in the desired domain. This approach was elaborated in detail for the matrix-valued Gaussian processes and generalised to processes with 'heavier tails' like Student-t processes.

Inference of continuous values with a Gaussian process prior is known as Gaussian process regression, or kriging; extending Gaussian process regression to multiple target variables is known as cokriging. Gaussian processes are thus useful as a powerful non-linear multivariate interpolation tool.

Gaussian processes are also commonly used to tackle numerical analysis problems such as numerical integration, solving differential equations, or optimisation in the field of probabilistic numerics.

Gaussian processes can also be used in the context of mixture of experts models, for example.The underlying rationale of such a learning framework consists in the assumption that a given mapping cannot be well captured by a single Gaussian process model. Instead, the observation space is divided into subsets, each of which is characterized by a different mapping function; each of these is learned via a different Gaussian process component in the postulated mixture.

In the natural sciences, Gaussian processes have found use as probabilistic models of astronomical time series and as predictors of molecular properties.

Gaussian process prediction, or Kriging

When concerned with a general Gaussian process regression problem (Kriging), it is assumed that for a Gaussian process $f$ observed at coordinates $x$ , the vector of values $f (x)$ is just one sample from a multivariate Gaussian distribution of dimension equal to number of observed coordinates $n$ . Therefore, under the assumption of a zero-mean distribution, $f (x^{'}) \sim N (0, K (θ, x, x^{'}))$ , where $K (θ, x, x^{'})$ is the covariance matrix between all possible pairs $(x, x^{'})$ for a given set of hyperparameters θ. As such the log marginal likelihood is:

\log p (f (x^{'}) ∣ θ, x) = - \frac{1}{2} (f (x)^{T} K (θ, x, x^{'})^{- 1} f (x^{'}) + \log det (K (θ, x, x^{'})) + n \log 2 π)

and maximizing this marginal likelihood towards $θ$ provides the complete specification of the Gaussian process $f$ . One can briefly note at this point that the first term corresponds to a penalty term for a model's failure to fit observed values and the second term to a penalty term that increases proportionally to a model's complexity. Having specified $θ$ , making predictions about unobserved values $f (x^{*})$ at coordinates $x *$ is then only a matter of drawing samples from the predictive distribution $p (y^{*} ∣ x^{*}, f (x), x) = N (y^{*} ∣ A, B)$ where the posterior mean estimate $A$ is defined as

A = K (θ, x^{*}, x) K (θ, x, x^{'})^{- 1} f (x)

and the posterior variance estimate B is defined as:

B = K (θ, x^{*}, x^{*}) - K (θ, x^{*}, x) K (θ, x, x^{'})^{- 1} K (θ, x^{*}, x)^{T}

where

K (θ, x^{*}, x)

is the covariance between the new coordinate of estimation x* and all other observed coordinates x for a given hyperparameter vector

θ

K (θ, x, x^{'})

and

f (x)

are defined as before and

K (θ, x^{*}, x^{*})

is the variance at point

x *

as dictated by

θ

. It is important to note that practically the posterior mean estimate of

f (x^{*})

(the "point estimate") is just a linear combination of the observations

f (x)

; in a similar manner the variance of

f (x^{*})

is actually independent of the observations

f (x)

. A known bottleneck in Gaussian process prediction is that the computational complexity of inference and likelihood evaluation is cubic in the number of points |x|, and as such can become unfeasible for larger data sets. Works on sparse Gaussian processes, that usually are based on the idea of building a representative set for the given process f, try to circumvent this issue. The kriging method can be used in the latent level of a nonlinear mixed-effects model for a spatial functional prediction: this technique is called the latent kriging.

Often, the covariance has the form $K (θ, x, x^{'}) = \frac{1}{σ^{2}} \tilde{K} (θ, x, x^{'})$ , where $σ^{2}$ is a scaling parameter. Examples are the Matérn class covariance functions. If this scaling parameter $σ^{2}$ is either known or unknown (i.e. must be marginalized), then the posterior probability, $p (θ ∣ D)$ , i.e. the probability for the hyperparameters $θ$ given a set of data pairs $D$ of observations of $x$ and $f (x)$ , admits an analytical expression.

Bayesian neural networks as Gaussian processes

Bayesian neural networks are a particular type of Bayesian network that results from treating deep learning and artificial neural network models probabilistically, and assigning a prior distribution to their parameters. Computation in artificial neural networks is usually organized into sequential layers of artificial neurons. The number of neurons in a layer is called the layer width. As layer width grows large, many Bayesian neural networks reduce to a Gaussian process with a closed form compositional kernel. This Gaussian process is called the Neural Network Gaussian Process (NNGP). It allows predictions from Bayesian neural networks to be more efficiently evaluated, and provides an analytic tool to understand deep learning models.

Computational issues

In practical applications, Gaussian process models are often evaluated on a grid leading to multivariate normal distributions. Using these models for prediction or parameter estimation using maximum likelihood requires evaluating a multivariate Gaussian density, which involves calculating the determinant and the inverse of the covariance matrix. Both of these operations have cubic computational complexity which means that even for grids of modest sizes, both operations can have a prohibitive computational cost. This drawback led to the development of multiple approximation methods.

Search This Blog

Monday, December 4, 2023

Statistical model

Introduction

Formal definition

An example

General remarks

Dimension of a model

Nested models

Comparing models

Volatile organic compound

Definitions

Canada

European Union

China

India

United States

Biologically generated VOCs

Anthropogenic sources

Indoor VOCs

Indoor air quality measurements

Regulation of indoor VOC emissions

Health risks

Ingestion

Dermal absorption

Limit values for VOC emissions

VOCs in healthcare settings

Analytical methods

Sampling

Principle and measurement methods

Chemical fingerprinting and breath analysis

Metrology for VOC measurements

Gaussian process

Definition

Variance

Stationarity

Example

Covariance functions

Usual covariance functions

Continuity

Stationary case

Brownian motion as the integral of Gaussian processes

Driscoll's zero-one law

Linearly constrained Gaussian processes

Applications

Gaussian process prediction, or Kriging

Bayesian neural networks as Gaussian processes

Computational issues

Numerical analysis