A Medley of Potpourri

Tuesday, November 5, 2024

Synthetic data

From Wikipedia, the free encyclopedia
https://en.wikipedia.org/wiki/Synthetic_data

Synthetic data are artificially generated data rather than produced by real-world events. Typically created using algorithms, synthetic data can be deployed to validate mathematical models and to train machine learning models.

Data generated by a computer simulation can be seen as synthetic data. This encompasses most applications of physical modeling, such as music synthesizers or flight simulators. The output of such systems approximates the real thing, but is fully algorithmically generated.

Synthetic data is used in a variety of fields as a filter for information that would otherwise compromise the confidentiality of particular aspects of the data. In many sensitive applications, datasets theoretically exist but cannot be released to the general public; synthetic data sidesteps the privacy issues that arise from using real consumer information without permission or compensation.

Usefulness

Synthetic data is generated to meet specific needs or certain conditions that may not be found in the original, real data. One of the hurdles in applying up-to-date machine learning approaches for complex scientific tasks is the scarcity of labeled data, a gap effectively bridged by the use of synthetic data, which closely replicates real experimental data. This can be useful when designing many systems, from simulations based on theoretical value, to database processors, etc. This helps detect and solve unexpected issues such as information processing limitations. Synthetic data are often generated to represent the authentic data and allows a baseline to be set. Another benefit of synthetic data is to protect the privacy and confidentiality of authentic data, while still allowing for use in testing systems.

A science article's abstract, quoted below, describes software that generates synthetic data for testing fraud detection systems. "This enables us to create realistic behavior profiles for users and attackers. The data is used to train the fraud detection system itself, thus creating the necessary adaptation of the system to a specific environment." In defense and military contexts, synthetic data is seen as a potentially valuable tool to develop and improve complex AI systems, particularly in contexts where high-quality real-world data is scarce. At the same time, synthetic data together with the testing approach can give the ability to model

History

Scientific modelling of physical systems, which allows to run simulations in which one can estimate/compute/generate datapoints that haven't been observed in actual reality, has a long history that runs concurrent with the history of physics itself. For example, research into synthesis of audio and voice can be traced back to the 1930s and before, driven forward by the developments of e.g. the telephone and audio recording. Digitization gave rise to software synthesizers from the 1970s onwards.

In the context of privacy-preserving statistical analysis, in 1993, the idea of original fully synthetic data was created by Rubin. Rubin originally designed this to synthesize the Decennial Census long form responses for the short form households. He then released samples that did not include any actual long form records - in this he preserved anonymity of the household. Later that year, the idea of original partially synthetic data was created by Little. Little used this idea to synthesize the sensitive values on the public use file.

A 1993 work fitted a statistical model to 60,000 MNIST digits, then it was used to generate over 1 million examples. Those were used to train a LeNet-4 to reach state of the art performance.

In 1994, Fienberg came up with the idea of critical refinement, in which he used a parametric posterior predictive distribution (instead of a Bayes bootstrap) to do the sampling. Later, other important contributors to the development of synthetic data generation were Trivellore Raghunathan, Jerry Reiter, Donald Rubin, John M. Abowd, and Jim Woodcock. Collectively they came up with a solution for how to treat partially synthetic data with missing data. Similarly they came up with the technique of Sequential Regression Multivariate Imputation.

Calculations

Researchers test the framework on synthetic data, which is "the only source of ground truth on which they can objectively assess the performance of their algorithms".

Synthetic data can be generated through the use of random lines, having different orientations and starting positions. Datasets can get fairly complicated. A more complicated dataset can be generated by using a synthesizer build. To create a synthesizer build, first use the original data to create a model or equation that fits the data the best. This model or equation will be called a synthesizer build. This build can be used to generate more data.

Constructing a synthesizer build involves constructing a statistical model. In a linear regression line example, the original data can be plotted, and a best fit linear line can be created from the data. This line is a synthesizer created from the original data. The next step will be generating more synthetic data from the synthesizer build or from this linear line equation. In this way, the new data can be used for studies and research, and it protects the confidentiality of the original data.

David Jensen from the Knowledge Discovery Laboratory explains how to generate synthetic data: "Researchers frequently need to explore the effects of certain data characteristics on their data model." To help construct datasets exhibiting specific properties, such as auto-correlation or degree disparity, proximity can generate synthetic data having one of several types of graph structure: random graphs that are generated by some random process; lattice graphs having a ring structure; lattice graphs having a grid structure, etc. In all cases, the data generation process follows the same process:

Generate the empty graph structure.
Generate attribute values based on user-supplied prior probabilities.

Since the attribute values of one object may depend on the attribute values of related objects, the attribute generation process assigns values collectively.

Applications

Fraud detection and confidentiality systems

Testing and training fraud detection and confidentiality systems are devised using synthetic data. Specific algorithms and generators are designed to create realistic data, which then assists in teaching a system how to react to certain situations or criteria. For example, intrusion detection software is tested using synthetic data. This data is a representation of the authentic data and may include intrusion instances that are not found in the authentic data. The synthetic data allows the software to recognize these situations and react accordingly. If synthetic data was not used, the software would only be trained to react to the situations provided by the authentic data and it may not recognize another type of intrusion.

Scientific research

Researchers doing clinical trials or any other research may generate synthetic data to aid in creating a baseline for future studies and testing.

Real data can contain information that researchers may not want released, so synthetic data is sometimes used to protect the privacy and confidentiality of a dataset. Using synthetic data reduces confidentiality and privacy issues since it holds no personal information and cannot be traced back to any individual.

Machine learning

Synthetic data is increasingly being used for machine learning applications: a model is trained on a synthetically generated dataset with the intention of transfer learning to real data. Efforts have been made to enable more data science experiments via the construction of general-purpose synthetic data generators, such as the Synthetic Data Vault. In general, synthetic data has several natural advantages:

once the synthetic environment is ready, it is fast and cheap to produce as much data as needed;
synthetic data can have perfectly accurate labels, including labeling that may be very expensive or impossible to obtain by hand;
the synthetic environment can be modified to improve the model and training;
synthetic data can be used as a substitute for certain real data segments that contain, e.g., sensitive information.

This usage of synthetic data has been proposed for computer vision applications, in particular object detection, where the synthetic environment is a 3D model of the object, and learning to navigate environments by visual information.

At the same time, transfer learning remains a nontrivial problem, and synthetic data has not become ubiquitous yet. Research results indicate that adding a small amount of real data significantly improves transfer learning with synthetic data. Advances in generative adversarial networks (GAN), lead to the natural idea that one can produce data and then use it for training. Since at least 2016, such adversarial training has been successfully used to produce synthetic data of sufficient quality to produce state-of-the-art results in some domains, without even needing to re-mix real data in with the generated synthetic data.

Examples

In 1987, a Navlab autonomous vehicle used 1200 synthetic road images as one approach to training.

In 2021, Microsoft released a database of 100,000 synthetic faces based on (500 real faces) that claims to "match real data in accuracy".

Optical depth

From Wikipedia, the free encyclopedia

In physics, optical depth or optical thickness is the natural logarithm of the ratio of incident to transmitted radiant power through a material. Thus, the larger the optical depth, the smaller the amount of transmitted radiant power through the material. Spectral optical depth or spectral optical thickness is the natural logarithm of the ratio of incident to transmitted spectral radiant power through a material. Optical depth is dimensionless, and in particular is not a length, though it is a monotonically increasing function of optical path length, and approaches zero as the path length approaches zero. The use of the term "optical density" for optical depth is discouraged.

In chemistry, a closely related quantity called "absorbance" or "decadic absorbance" is used instead of optical depth: the common logarithm of the ratio of incident to transmitted radiant power through a material. It is the optical depth divided by $log e (10)$ , because of the different logarithm bases used.

Mathematical definitions

Optical depth

Optical depth of a material, denoted $τ$ , is given by: $τ = \ln (\frac{Φ_{e}^{i}}{Φ_{e}^{t}}) = - \ln T$ where

$Φ_{e}^{i}$ is the radiant flux received by that material;
$Φ_{e}^{t}$ is the radiant flux transmitted by that material;
$T$ is the transmittance of that material.

The absorbance $A$ is related to optical depth by: $τ = A \ln 10$

Spectral optical depth

Spectral optical depth in frequency and spectral optical depth in wavelength of a material, denoted $τ_{ν}$ and $τ_{λ}$ respectively, are given by: $τ_{ν} = \ln (\frac{Φ_{e, ν}^{i}}{Φ_{e, ν}^{t}}) = - \ln T_{ν}$ $τ_{λ} = \ln (\frac{Φ_{e, λ}^{i}}{Φ_{e, λ}^{t}}) = - \ln T_{λ},$ where

$Φ_{e, ν}^{t}$ is the spectral radiant flux in frequency transmitted by that material;
$Φ_{e, ν}^{i}$ is the spectral radiant flux in frequency received by that material;
$T_{ν}$ is the spectral transmittance in frequency of that material;
$Φ_{e, λ}^{t}$ is the spectral radiant flux in wavelength transmitted by that material;
$Φ_{e, λ}^{i}$ is the spectral radiant flux in wavelength received by that material;
$T_{λ}$ is the spectral transmittance in wavelength of that material.

Spectral absorbance is related to spectral optical depth by: $τ_{ν} = A_{ν} \ln 10,$ $τ_{λ} = A_{λ} \ln 10,$ where

$A_{ν}$ is the spectral absorbance in frequency;
$A_{λ}$ is the spectral absorbance in wavelength.

Relationship with attenuation

Attenuation

Optical depth measures the attenuation of the transmitted radiant power in a material. Attenuation can be caused by absorption, but also reflection, scattering, and other physical processes. Optical depth of a material is approximately equal to its attenuation when both the absorbance is much less than 1 and the emittance of that material (not to be confused with radiant exitance or emissivity) is much less than the optical depth: $Φ_{e}^{t} + Φ_{e}^{a t t} = Φ_{e}^{i} + Φ_{e}^{e},$ $T + A T T = 1 + E,$ where

Φ_e^t is the radiant power transmitted by that material;
Φ_e^att is the radiant power attenuated by that material;
Φ_eⁱ is the radiant power received by that material;
Φ_e^e is the radiant power emitted by that material;
T = Φ_e^t/Φ_eⁱ is the transmittance of that material;
ATT = Φ_e^att/Φ_eⁱ is the attenuation of that material;
E = Φ_e^e/Φ_eⁱ is the emittance of that material,

and according to the Beer–Lambert law, $T = e^{- τ},$ so: $A T T = 1 - e^{- τ} + E \approx τ + E \approx τ, if τ ≪ 1 and E ≪ τ .$

Attenuation coefficient

Optical depth of a material is also related to its attenuation coefficient by: $τ = \int_{0}^{l} α (z) d z,$ where

l is the thickness of that material through which the light travels;
α(z) is the attenuation coefficient or Napierian attenuation coefficient of that material at z,

and if α(z) is uniform along the path, the attenuation is said to be a linear attenuation and the relation becomes: $τ = α l$

Sometimes the relation is given using the attenuation cross section of the material, that is its attenuation coefficient divided by its number density: $τ = \int_{0}^{l} σ n (z) d z,$ where

σ is the attenuation cross section of that material;
n(z) is the number density of that material at z,

and if $n$ is uniform along the path, i.e., $n (z) \equiv N$ , the relation becomes: $τ = σ N l$

Applications

Atomic physics

In atomic physics, the spectral optical depth of a cloud of atoms can be calculated from the quantum-mechanical properties of the atoms. It is given by $τ_{ν} = \frac{d^{2} n ν}{2 c ℏ ε_{0} σ γ}$ where

d is the transition dipole moment;
n is the number of atoms;
ν is the frequency of the beam;
c is the speed of light;
ħ is the reduced Planck constant;
ε₀ is the vacuum permittivity;
σ is the cross section of the beam;
γ is the natural linewidth of the transition.

Atmospheric sciences

In atmospheric sciences, one often refers to the optical depth of the atmosphere as corresponding to the vertical path from Earth's surface to outer space; at other times the optical path is from the observer's altitude to outer space. The optical depth for a slant path is τ = mτ′, where τ′ refers to a vertical path, m is called the relative airmass, and for a plane-parallel atmosphere it is determined as m = sec θ where θ is the zenith angle corresponding to the given path. Therefore, $T = e^{- τ} = e^{- m τ^{'}}$ The optical depth of the atmosphere can be divided into several components, ascribed to Rayleigh scattering, aerosols, and gaseous absorption. The optical depth of the atmosphere can be measured with a Sun photometer.

The optical depth with respect to the height within the atmosphere is given by $τ (z) = k_{a} w_{1} ρ_{0} H e^{- z / H}$ and it follows that the total atmospheric optical depth is given by $τ (0) = k_{a} w_{1} ρ_{0} H$

In both equations:

k_a is the absorption coefficient
w₁ is the mixing ratio
ρ₀ is the density of air at sea level
H is the scale height of the atmosphere
z is the height in question

The optical depth of a plane parallel cloud layer is given by $τ = Q_{e} {[\frac{9 π L^{2} H N}{16 ρ_{l}^{2}}]}^{1 / 3}$ where:

Q_e is the extinction efficiency
L is the liquid water path
H is the geometrical thickness
N is the concentration of droplets
ρ_l is the density of liquid water

So, with a fixed depth and total liquid water path, $τ \propto N^{1 / 3}$ .

Astronomy

In astronomy, the photosphere of a star is defined as the surface where its optical depth is 2/3. This means that each photon emitted at the photosphere suffers an average of less than one scattering before it reaches the observer. At the temperature at optical depth 2/3, the energy emitted by the star (the original derivation is for the Sun) matches the observed total energy emitted.

Note that the optical depth of a given medium will be different for different colors (wavelengths) of light.

For planetary rings, the optical depth is the (negative logarithm of the) proportion of light blocked by the ring when it lies between the source and the observer. This is usually obtained by observation of stellar occultations.

Monday, November 4, 2024

Crystal optics

From Wikipedia, the free encyclopedia
https://en.wikipedia.org/wiki/Crystal_optics

Crystal optics is the branch of optics that describes the behaviour of light in anisotropic media, that is, media (such as crystals) in which light behaves differently depending on which direction the light is propagating. The index of refraction depends on both composition and crystal structure and can be calculated using the Gladstone–Dale relation. Crystals are often naturally anisotropic, and in some media (such as liquid crystals) it is possible to induce anisotropy by applying an external electric field.

Isotropic media

Typical transparent media such as glasses are isotropic, which means that light behaves the same way no matter which direction it is travelling in the medium. In terms of Maxwell's equations in a dielectric, this gives a relationship between the electric displacement field D and the electric field E:

D = ε_{0} E + P

where ε₀ is the permittivity of free space and P is the electric polarization (the vector field corresponding to electric dipole moments present in the medium). Physically, the polarization field can be regarded as the response of the medium to the electric field of the light.

Electric susceptibility

In an isotropic and linear medium, this polarization field P is proportional and parallel to the electric field E:

P = χ ε_{0} E

where χ is the electric susceptibility of the medium. The relation between D and E is thus:

D = ε_{0} E + χ ε_{0} E = ε_{0} (1 + χ) E = ε E

where

ε = ε_{0} (1 + χ)

is the dielectric constant of the medium. The value 1+χ is called the relative permittivity of the medium, and is related to the refractive index n, for non-magnetic media, by

n = \sqrt{1 + χ}

Anisotropic media

In an anisotropic medium, such as a crystal, the polarisation field P is not necessarily aligned with the electric field of the light E. In a physical picture, this can be thought of as the dipoles induced in the medium by the electric field having certain preferred directions, related to the physical structure of the crystal. This can be written as:

P = ε_{0} χ E .

Here χ is not a number as before but a tensor of rank 2, the electric susceptibility tensor. In terms of components in 3 dimensions:

$(\begin{matrix} P_{x} \\ P_{y} \\ P_{z} \end{matrix}) = ε_{0} (\begin{matrix} χ_{x x} & χ_{x y} & χ_{x z} \\ χ_{y x} & χ_{y y} & χ_{y z} \\ χ_{z x} & χ_{z y} & χ_{z z} \end{matrix}) (\begin{matrix} E_{x} \\ E_{y} \\ E_{z} \end{matrix})$

or using the summation convention:

P_{i} = ε_{0} \sum_{j \in {x, y, z}} χ_{i j} E_{j} .

Since χ is a tensor, P is not necessarily colinear with E.

In nonmagnetic and transparent materials, χ_ij = χ_ji, i.e. the χ tensor is real and symmetric. In accordance with the spectral theorem, it is thus possible to diagonalise the tensor by choosing the appropriate set of coordinate axes, zeroing all components of the tensor except χ_xx, χ_yy and χ_zz. This gives the set of relations:

P_{x} = ε_{0} χ_{x x} E_{x}

P_{y} = ε_{0} χ_{y y} E_{y}

P_{z} = ε_{0} χ_{z z} E_{z}

The directions x, y and z are in this case known as the principal axes of the medium. Note that these axes will be orthogonal if all entries in the χ tensor are real, corresponding to a case in which the refractive index is real in all directions.

It follows that D and E are also related by a tensor:

D = ε_{0} E + P = ε_{0} E + ε_{0} χ E = ε_{0} (I + χ) E = ε_{0} ε E .

Here ε is known as the relative permittivity tensor or dielectric tensor. Consequently, the refractive index of the medium must also be a tensor. Consider a light wave propagating along the z principal axis polarised such the electric field of the wave is parallel to the x-axis. The wave experiences a susceptibility χ_xx and a permittivity ε_xx. The refractive index is thus:

n_{x x} = (1 + χ_{x x})^{1 / 2} = (ε_{x x})^{1 / 2} .

For a wave polarised in the y direction:

n_{y y} = (1 + χ_{y y})^{1 / 2} = (ε_{y y})^{1 / 2} .

Thus these waves will see two different refractive indices and travel at different speeds. This phenomenon is known as birefringence and occurs in some common crystals such as calcite and quartz.

If χ_xx = χ_yy ≠ χ_zz, the crystal is known as uniaxial. (See Optic axis of a crystal.) If χ_xx ≠ χ_yy and χ_yy ≠ χ_zz the crystal is called biaxial. A uniaxial crystal exhibits two refractive indices, an "ordinary" index (n_o) for light polarised in the x or y directions, and an "extraordinary" index (n_e) for polarisation in the z direction. A uniaxial crystal is "positive" if n_e > n_o and "negative" if n_e < n_o. Light polarised at some angle to the axes will experience a different phase velocity for different polarization components, and cannot be described by a single index of refraction. This is often depicted as an index ellipsoid.

Other effects

Certain nonlinear optical phenomena such as the electro-optic effect cause a variation of a medium's permittivity tensor when an external electric field is applied, proportional (to lowest order) to the strength of the field. This causes a rotation of the principal axes of the medium and alters the behaviour of light travelling through it; the effect can be used to produce light modulators.

In response to a magnetic field, some materials can have a dielectric tensor that is complex-Hermitian; this is called a gyro-magnetic or magneto-optic effect. In this case, the principal axes are complex-valued vectors, corresponding to elliptically polarized light, and time-reversal symmetry can be broken. This can be used to design optical isolators, for example.

A dielectric tensor that is not Hermitian gives rise to complex eigenvalues, which corresponds to a material with gain or absorption at a particular frequency.

A Medley of Potpourri

Search This Blog

Tuesday, November 5, 2024

Synthetic data

Usefulness

History

Calculations

Applications

Fraud detection and confidentiality systems

Scientific research

Machine learning

Examples

Optical depth

Mathematical definitions

Optical depth

Spectral optical depth

Relationship with attenuation

Attenuation

Attenuation coefficient

Applications

Atomic physics

Atmospheric sciences

Astronomy

Monday, November 4, 2024

Crystal optics

Isotropic media

Electric susceptibility

Anisotropic media

Other effects

Curiosity

Followers

Total Pageviews