A Medley of Potpourri

Sunday, December 3, 2023

Structural bioinformatics

From Wikipedia, the free encyclopedia

https://en.wikipedia.org/wiki/Structural_bioinformatics

Structural bioinformatics is the branch of bioinformatics that is related to the analysis and prediction of the three-dimensional structure of biological macromolecules such as proteins, RNA, and DNA. It deals with generalizations about macromolecular 3D structures such as comparisons of overall folds and local motifs, principles of molecular folding, evolution, binding interactions, and structure/function relationships, working both from experimentally solved structures and from computational models. The term structural has the same meaning as in structural biology, and structural bioinformatics can be seen as a part of computational structural biology. The main objective of structural bioinformatics is the creation of new methods of analysing and manipulating biological macromolecular data in order to solve problems in biology and generate new knowledge.

Introduction

Protein structure

The structure of a protein is directly related to its function. The presence of certain chemical groups in specific locations allows proteins to act as enzymes, catalyzing several chemical reactions. In general, protein structures are classified into four levels: primary (sequences), secondary (local conformation of the polypeptide chain), tertiary (three-dimensional structure of the protein fold), and quaternary (association of multiple polypeptide structures). Structural bioinformatics mainly addresses interactions among structures taking into consideration their space coordinates. Thus, the primary structure is better analyzed in traditional branches of bioinformatics. However, the sequence implies restrictions that allow the formation of conserved local conformations of the polypeptide chain, such as alpha-helix, beta-sheets, and loops (secondary structure). Also, weak interactions (such as hydrogen bonds) stabilize the protein fold. Interactions could be intrachain, i.e., when occurring between parts of the same protein monomer (tertiary structure), or interchain, i.e., when occurring between different structures (quaternary structure). Finally, the topological arrangement of interactions, whether strong or weak, and entanglements is being studied in the field of structural bioinformatics, utilizing frameworks such as circuit topology.

Structure visualization

Structural visualization of BACTERIOPHAGE T4 LYSOZYME (PDB ID: 2LZM). (A) Cartoon; (B) Lines; (C) Surface; (D) Sticks.

Protein structure visualization is an important issue for structural bioinformatics. It allows users to observe static or dynamic representations of the molecules, also allowing the detection of interactions that may be used to make inferences about molecular mechanisms. The most common types of visualization are:

Cartoon: this type of protein visualization highlights the secondary structure differences. In general, α-helix is represented as a type of screw, β-strands as arrows, and loops as lines.
Lines: each amino acid residue is represented by thin lines, which allows a low cost for graphic rendering.
Surface: in this visualization, the external shape of the molecule is shown.
Sticks: each covalent bond between amino acid atoms is represented as a stick. This type of visualization is most used to visualize interactions between amino acids...

DNA structure

The classic DNA duplexes structure was initially described by Watson and Crick (and contributions of Rosalind Franklin). The DNA molecule is composed of three substances: a phosphate group, a pentose, and a nitrogen base (adenine, thymine, cytosine, or guanine). The DNA double helix structure is stabilized by hydrogen bonds formed between base pairs: adenine with thymine (A-T) and cytosine with guanine (C-G). Many structural bioinformatics studies have focused on understanding interactions between DNA and small molecules, which has been the target of several drug design studies.

Interactions

Interactions are contacts established between parts of molecules at different levels. They are responsible for stabilizing protein structures and perform a varied range of activities. In biochemistry, interactions are characterized by the proximity of atom groups or molecules regions that present an effect upon one another, such as electrostatic forces, hydrogen bonding, and hydrophobic effect. Proteins can perform several types of interactions, such as protein-protein interactions (PPI), protein-peptide interactions, protein-ligand interactions (PLI), and protein-DNA interaction.

Calculating contacts

Calculating contacts is an important task in structural bioinformatics, being important for the correct prediction of protein structure and folding, thermodynamic stability, protein-protein and protein-ligand interactions, docking and molecular dynamics analyses, and so on.

Traditionally, computational methods have used threshold distance between atoms (also called cutoff) to detect possible interactions. This detection is performed based on Euclidean distance and angles between atoms of determined types. However, most of the methods based on simple Euclidean distance cannot detect occluded contacts. Hence, cutoff free methods, such as Delaunay triangulation, have gained prominence in recent years. In addition, the combination of a set of criteria, for example, physicochemical properties, distance, geometry, and angles, have been used to improve the contact determination.

Protein Data Bank (PDB)

The Protein Data Bank (PDB) is a database of 3D structure data for large biological molecules, such as proteins, DNA, and RNA. PDB is managed by an international organization called the Worldwide Protein Data Bank (wwPDB), which is composed of several local organizations, as. PDBe, PDBj, RCSB, and BMRB. They are responsible for keeping copies of PDB data available on the internet at no charge. The number of structure data available at PDB has increased each year, being obtained typically by X-ray crystallography, NMR spectroscopy, or cryo-electron microscopy.

Data format

The PDB format (.pdb) is the legacy textual file format used to store information of three-dimensional structures of macromolecules used by the Protein Data Bank. Due to restrictions in the format structure conception, the PDB format does not allow large structures containing more than 62 chains or 99999 atom records.

The PDBx/mmCIF (macromolecular Crystallographic Information File) is a standard text file format for representing crystallographic information. Since 2014, the PDB format was substituted as the standard PDB archive distribution by the PDBx/mmCIF file format (.cif). While PDB format contains a set of records identified by a keyword of up to six characters, the PDBx/mmCIF format uses a structure based on key and value, where the key is a name that identifies some feature and the value is the variable information.

Other structural databases

In addition to the Protein Data Bank (PDB), there are several databases of protein structures and other macromolecules. Examples include:

MMDB: Experimentally determined three-dimensional structures of biomolecules derived from Protein Data Bank (PDB).
Nucleic acid Data Base (NDB): Experimentally determined information about nucleic acids (DNA, RNA).
Structural Classification of Proteins (SCOP): Comprehensive description of the structural and evolutionary relationships between structurally known proteins.
TOPOFIT-DB: Protein structural alignments based on the TOPOFIT method.
Electron Density Server (EDS): Electron-density maps and statistics about the fit of crystal structures and their maps.
CASP: Prediction Center Community-wide, worldwide experiment for protein structure prediction CASP.
PISCES server for creating non-redundant lists of proteins: Generates PDB list by sequence identity and structural quality criteria.
The Structural Biology Knowledgebase: Tools to aid in protein research design.
ProtCID: The Protein Common Interface Database Database of similar protein-protein interfaces in crystal structures of homologous proteins.
AlphaFold:AlphaFold - Protein Structure Database.

Structure comparison

Structural alignment

Structural alignment is a method for comparison between 3D structures based on their shape and conformation. It could be used to infer the evolutionary relationship among a set of proteins even with low sequence similarity. Structural alignment implies superimposing a 3D structure over a second one, rotating and translating atoms in corresponding positions (in general, using the C_α atoms or even the backbone heavy atoms C, N, O, and C_α). Usually, the alignment quality is evaluated based on the root-mean-square deviation (RMSD) of atomic positions, i.e., the average distance between atoms after superimposition:

R M S D = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} δ_{i}^{2}}

where δ_i is the distance between atom i and either a reference atom corresponding in the other structure or the mean coordinate of the N equivalent atoms. In general, the RMSD outcome is measured in Ångström (Å) unit, which is equivalent to 10⁻¹⁰ m. The nearer to zero the RMSD value, the more similar are the structures.

Graph-based structural signatures

Structural signatures, also called fingerprints, are macromolecule pattern representations that can be used to infer similarities and differences. Comparisons among a large set of proteins using RMSD still is a challenge due to the high computational cost of structural alignments. Structural signatures based on graph distance patterns among atom pairs have been used to determine protein identifying vectors and to detect non-trivial information. Furthermore, linear algebra and machine learning can be used for clustering protein signatures, detecting protein-ligand interactions, predicting ΔΔG, and proposing mutations based on Euclidean distance.

Structure prediction

The atomic structures of molecules can be obtained by several methods, such as X-ray crystallography (XRC), NMR spectroscopy, and 3D electron microscopy; however, these processes can present high costs and sometimes some structures can be hardly established, such as membrane proteins. Hence, it is necessary to use computational approaches for determining 3D structures of macromolecules. The structure prediction methods are classified into comparative modeling and de novo modeling.

Comparative modeling

Comparative modeling, also known as homology modeling, corresponds to the methodology to construct three-dimensional structures from an amino acid sequence of a target protein and a template with known structure. The literature has described that evolutionarily related proteins tend to present a conserved three-dimensional structure. In addition, sequences of distantly related proteins with identity lower than 20% can present different folds.

De novo modeling

In structural bioinformatics, de novo modeling, also known as ab initio modeling, refers to approaches for obtaining three-dimensional structures from sequences without the necessity of a homologous known 3D structure. Despite the new algorithms and methods proposed in the last years, de novo protein structure prediction is still considered one of the remain outstanding issues in modern science.

Structure validation

After structure modeling, an additional step of structure validation is necessary since many of both comparative and 'de novo' modeling algorithms and tools use heuristics to try assembly the 3D structure, which can generate many errors. Some validation strategies consist of calculating energy scores and comparing them with experimentally determined structures. For example, the DOPE score is an energy score used by the MODELLER tool for determining the best model.

Another validation strategy is calculating φ and ψ backbone dihedral angles of all residues and construct a Ramachandran plot. The side-chain of amino acids and the nature of interactions in the backbone restrict these two angles, and thus, the visualization of allowed conformations could be performed based on the Ramachandran plot. A high quantity of amino acids allocated in no permissive positions of the chart is an indication of a low-quality modeling.

Prediction tools

A list with commonly used software tools for protein structure prediction, including comparative modeling, protein threading, de novo protein structure prediction, and secondary structure prediction is available in the list of protein structure prediction software.

Molecular docking

Molecular docking (also referred to only as docking) is a method used to predict the orientation coordinates of a molecule (ligand) when bound to another one (receptor or target). The binding may be mostly through non-covalent interactions while covalently linked binding can also be studied. Molecular docking aims to predict possible poses (binding modes) of the ligand when it interacts with specific regions on the receptor. Docking tools use force fields to estimate a score for ranking best poses that favored better interactions between the two molecules.

In general, docking protocols are used to predict the interactions between small molecules and proteins. However, docking also can be used to detect associations and binding modes among proteins, peptides, DNA or RNA molecules, carbohydrates, and other macromolecules.

Virtual screening

Virtual screening (VS) is a computational approach used for fast screening of large compound libraries for drug discovery. Usually, virtual screening uses docking algorithms to rank small molecules with the highest affinity to a target receptor.

In recent times, several tools have been used to evaluate the use of virtual screening in the process of discovering new drugs. However, problems such as missing information, inaccurate understanding of drug-like molecular properties, weak scoring functions, or insufficient docking strategies hinder the docking process. Hence, the literature has described that it is still not considered a mature technology.

Molecular dynamics

Molecular dynamics (MD) is a computational method for simulating interactions between molecules and their atoms during a given period of time. This method allows the observation of the behavior of molecules and their interactions, considering the system as a whole. To calculate the behavior of the systems and, thus, determine the trajectories, an MD can use Newton's equation of motion, in addition to using molecular mechanics methods to estimate the forces that occur between particles (force fields).

Applications

Informatics approaches used in structural bioinformatics are:

Selection of Target - Potential targets are identified by comparing them with databases of known structures and sequence. The importance of a target can be decided on the basis of published literature. Target can also be selected on the basis of its protein domain. Protein domains are building blocks that can be rearranged to form new proteins. They can be studied in isolation initially.
Tracking X-ray crystallography trials - X-Ray crystallography can be used to reveal three-dimensional structure of a protein. But, in order to use X-ray for studying protein crystals, pure proteins crystals must be formed, which can take a lot of trials. This leads to a need for tracking the conditions and results of trials. Furthermore, supervised machine learning algorithms can be used on the stored data to identify conditions that might increase the yield of pure crystals.
Analysis of X-Ray crystallographic data - The diffraction pattern obtained as a result of bombarding X-rays on electrons is Fourier transform of electron density distribution. There is a need for algorithms that can deconvolve Fourier transform with partial information ( due to missing phase information, as the detectors can only measure amplitude of diffracted X-rays, and not the phase shifts ). Extrapolation technique such as Multiwavelength anomalous dispersion can be used to generate electron density map, which uses the location of selenium atoms as a reference to determine rest of the structure. Standard Ball-and-stick model is generated from the electron density map.
Analysis of NMR spectroscopy data - Nuclear magnetic resonance spectroscopy experiments produce two (or higher) dimensional data, with each peak corresponding to a chemical group within the sample. Optimization methods are used to convert spectra into three dimensional structures.
Correlating Structural information with functional information - Structural studies can be used as probe for structural-functional relationship.

Tools

Distance criteria for contact definition
Type	Max distance criteria
Hydrogen bond	3,9 Å
Hydrophobic interaction	5 Å
Ionic interaction	6 Å
Aromatic Stacking	6 Å
Software	Description
I-TASSER	Predicting three-dimensional structure model of protein molecules from amino acid sequences.
MOE	Molecular Operating Environment (MOE) is an extensive platform including structural modeling for proteins, protein families and antibodies
SBL	The Structural Bioinformatics Library: end-user applications and advanced algorithms
BALLView	Molecular modeling and visualization
STING	Visualization and analysis
PyMOL	Viewer and modeling
VMD	Viewer, molecular dynamics
KiNG	An open-source Java kinemage viewer
STRIDE	Determination of secondary structure from coordinates
DSSP	Algorithm assigning a secondary structure to the amino acids of a protein
MolProbity	Structure-validation web server
PROCHECK	A structure-validation web service
CheShift	A protein structure-validation on-line application
3D-mol.js	A molecular viewer for web applications developed using Javascript
PROPKA	Rapid prediction of protein pKa values based on empirical structure/function relationships
CARA	Computer Aided Resonance Assignment
Docking Server	A molecular docking web server
StarBiochem	A java protein viewer, features direct search of protein databank
SPADE	The structural proteomics application development environment
PocketSuite	A web portal for various web-servers for binding site-level analysis. PocketSuite is divided into:: PocketDepth (Binding site prediction) PocketMatch (Binding site comparison), PocketAlign (Binding site alignment), and PocketAnnotate (Binding site annotation).
MSL	An open-source C++ molecular modeling software library for the implementation of structural analysis, prediction and design methods
PSSpred	Protein secondary structure prediction
Proteus	Webtool for suggesting mutation pairs
SDM	A server for predicting effects of mutations on protein stability

Information field theory

From Wikipedia, the free encyclopedia

https://en.wikipedia.org/wiki/Information_field_theory

Information field theory (IFT) is a Bayesian statistical field theory relating to signal reconstruction, cosmography, and other related areas. IFT summarizes the information available on a physical field using Bayesian probabilities. It uses computational techniques developed for quantum field theory and statistical field theory to handle the infinite number of degrees of freedom of a field and to derive algorithms for the calculation of field expectation values. For example, the posterior expectation value of a field generated by a known Gaussian process and measured by a linear device with known Gaussian noise statistics is given by a generalized Wiener filter applied to the measured data. IFT extends such known filter formula to situations with nonlinear physics, nonlinear devices, non-Gaussian field or noise statistics, dependence of the noise statistics on the field values, and partly unknown parameters of measurement. For this it uses Feynman diagrams, renormalisation flow equations, and other methods from mathematical physics.

Motivation

Fields play an important role in science, technology, and economy. They describe the spatial variations of a quantity, like the air temperature, as a function of position. Knowing the configuration of a field can be of large value. Measurements of fields, however, can never provide the precise field configuration with certainty. Physical fields have an infinite number of degrees of freedom, but the data generated by any measurement device is always finite, providing only a finite number of constraints on the field. Thus, an unambiguous deduction of such a field from measurement data alone is impossible and only probabilistic inference remains as a means to make statements about the field. Fortunately, physical fields exhibit correlations and often follow known physical laws. Such information is best fused into the field inference in order to overcome the mismatch of field degrees of freedom to measurement points. To handle this, an information theory for fields is needed, and that is what information field theory is.

Concepts

Bayesian inference

$s (x)$ is a field value at a location $x \in Ω$ in a space $Ω$ . The prior knowledge about the unknown signal field $s$ is encoded in the probability distribution $P (s)$ . The data $d$ provides additional information on $s$ via the likelihood $P (d | s)$ that gets incorporated into the posterior probability

P (s | d) = \frac{P (d | s) P (s)}{P (d)}

according to Bayes theorem.

Information Hamiltonian

In IFT Bayes theorem is usually rewritten in the language of a statistical field theory,

P (s | d) = \frac{P (d, s)}{P (d)} \equiv \frac{e^{- H (d, s)}}{Z (d)},

with the information Hamiltonian defined as

H (d, s) \equiv - \ln P (d, s) = - \ln P (d | s) - \ln P (s) \equiv H (d | s) + H (s),

the negative logarithm of the joint probability of data and signal and with the partition function being

Z (d) \equiv P (d) = \int D s P (d, s) .

This reformulation of Bayes theorem permits the usage of methods of mathematical physics developed for the treatment of statistical field theories and quantum field theories.

Fields

As fields have an infinite number of degrees of freedom, the definition of probabilities over spaces of field configurations has subtleties. Identifying physical fields as elements of function spaces provides the problem that no Lebesgue measure is defined over the latter and therefore probability densities can not be defined there. However, physical fields have much more regularity than most elements of function spaces, as they are continuous and smooth at most of their locations. Therefore less general, but sufficiently flexible constructions can be used to handle the infinite number of degrees of freedom of a field.

A pragmatic approach is to regard the field to be discretized in terms of pixels. Each pixel carries a single field value that is assumed to be constant within the pixel volume. All statements about the continuous field have then to be cast into its pixel representation. This way, one deals with finite dimensional field spaces, over which probability densities are well definable.

In order for this description to be a proper field theory, it is further required that the pixel resolution $Δ x$ can always be refined, while expectation values of the discretized field $s_{Δ x}$ converge to finite values:

⟨ f (s) ⟩_{(s | d)} \equiv lim_{Δ x \to 0} \int d s_{Δ x} f (s_{Δ x}) P (s_{Δ x}) .

Path integrals

If this limit exists, one can talk about the field configuration space integral or path integral

⟨ f (s) ⟩_{(s | d)} \equiv \int D s f (s) P (s) .

irrespective of the resolution it might be evaluated numerically.

Gaussian prior

The simplest prior for a field is that of a zero mean Gaussian probability distribution

P (s) = G (s, S) \equiv \frac{1}{| 2 π S |} e^{- \frac{1}{2} s^{†} S^{- 1} s} .

The determinant in the denominator might be ill-defined in the continuum limit

Δ x \to 0

, however, all what is necessary for IFT to be consistent is that this determinant can be estimated for any finite resolution field representation with

Δ x > 0

and that this permits the calculation of convergent expectation values.

A Gaussian probability distribution requires the specification of the field two point correlation function $S \equiv ⟨ s s^{†} ⟩_{(s)}$ with coefficients

S_{x y} \equiv ⟨ s (x) \bar{s (y)} ⟩_{(s)}

and a scalar product for continuous fields

a^{†} b \equiv \int_{Ω} d x \bar{a (x)} b (x),

with respect to which the inverse signal field covariance

S^{- 1}

is constructed, i.e.

(S^{- 1} S)_{x y} \equiv \int_{Ω} d z (S^{- 1})_{x z} S_{z y} = 1_{x y} \equiv δ (x - y) .

The corresponding prior information Hamiltonian reads

H (s) = - \ln G (s, S) = \frac{1}{2} s^{†} S^{- 1} s + \frac{1}{2} \ln | 2 π S | .

Measurement equation

The measurement data $d$ was generated with the likelihood $P (d | s)$ . In case the instrument was linear, a measurement equation of the form

d = R s + n

can be given, in which

R

is the instrument response, which describes how the data on average reacts to the signal, and

n

is the noise, simply the difference between data

d

and linear signal response

R s

. It is essential to note that the response translates the infinite dimensional signal vector into the finite dimensional data space. In components this reads

d_{i} = \int_{Ω} d x R_{i x} s_{x} + n_{i},

where a vector component notation was also introduced for signal and data vectors.

If the noise follows a signal independent zero mean Gaussian statistics with covariance $N$ , $P (n | s) = G (n, N),$ then the likelihood is Gaussian as well,

P (d | s) = G (d - R s, N),

and the likelihood information Hamiltonian is

H (d | s) = - \ln G (d - R s, N) = \frac{1}{2} (d - R s)^{†} N^{- 1} (d - R s) + \frac{1}{2} \ln | 2 π N | .

A linear measurement of a Gaussian signal, subject to Gaussian and signal-independent noise leads to a free IFT.

Free theory

Free Hamiltonian

The joint information Hamiltonian of the Gaussian scenario described above is

\begin{aligned} H (d, s) & = H (d | s) + H (s) \\ \hat{=} \frac{1}{2} (d - R s)^{†} N^{- 1} (d - R s) + \frac{1}{2} s^{†} S^{- 1} s \\ \hat{=} \frac{1}{2} [s^{†} \underset{D^{- 1}}{\underset{⏟}{(S^{- 1} + R^{†} N^{- 1} R)}} s - s^{†} \underset{j}{\underset{⏟}{R^{†} N^{- 1} d}} - \underset{j^{†}}{\underset{⏟}{d^{†} N^{- 1} R}} s] \\ \equiv \frac{1}{2} [s^{†} D^{- 1} s - s^{†} j - j^{†} s] \\ = \frac{1}{2} [s^{†} D^{- 1} s - s^{†} D^{- 1} \underset{m}{\underset{⏟}{D j}} - \underset{m^{†}}{\underset{⏟}{j^{†} D}} D^{- 1} s] \\ \hat{=} \frac{1}{2} (s - m)^{†} D^{- 1} (s - m), \end{aligned}

where

\hat{=}

denotes equality up to irrelevant constants, which, in this case, means expressions that are independent of

s

. From this is it clear, that the posterior must be a Gaussian with mean

m

and variance

D

P (s | d) \propto e^{- H (d, s)} \propto e^{- \frac{1}{2} (s - m)^{†} D^{- 1} (s - m)} \propto G (s - m, D)

where equality between the right and left hand sides holds as both distributions are normalized,

\int D s P (s | d) = 1 = \int D s G (s - m, D)

Generalized Wiener filter

The posterior mean

m = D j = (S^{- 1} + R^{†} N^{- 1} R)^{- 1} R^{†} N^{- 1} d

is also known as the generalized Wiener filter solution and the uncertainty covariance

D = (S^{- 1} + R^{†} N^{- 1} R)^{- 1}

as the Wiener variance.

In IFT, $j = R^{†} N^{- 1} d$ is called the information source, as it acts as a source term to excite the field (knowledge), and $D$ the information propagator, as it propagates information from one location to another in

m_{x} = \int_{Ω} d y D_{x y} j_{y} .

Interacting theory

Interacting Hamiltonian

If any of the assumptions that lead to the free theory is violated, IFT becomes an interacting theory, with terms that are of higher than quadratic order in the signal field. This happens when the signal or the noise are not following Gaussian statistics, when the response is non-linear, when the noise depends on the signal, or when response or covariances are uncertain.

In this case, the information Hamiltonian might be expandable in a Taylor-Fréchet series,

H (d, s) = \underset{= H_{free} (d, s)}{\underset{⏟}{\frac{1}{2} s^{†} D^{- 1} s - j^{†} s + H_{0}}} + \underset{= H_{int} (d, s)}{\underset{⏟}{\sum_{n = 3}^{\infty} \frac{1}{n!} Λ_{x_{1} . . . x_{n}}^{(n)} s_{x_{1}} . . . s_{x_{n}}}},

where

H_{free} (d, s)

is the free Hamiltonian, which alone would lead to a Gaussian posterior, and

H_{int} (d, s)

is the interacting Hamiltonian, which encodes non-Gaussian corrections. The first and second order Taylor coefficients are often identified with the (negative) information source

- j

and information propagator

D

, respectively. The higher coefficients

Λ_{x_{1} . . . x_{n}}^{(n)}

are associated with non-linear self-interactions.

Classical field

The classical field $s_{cl}$ minimizes the information Hamiltonian,

{\frac{\partial H (d, s)}{\partial s} |}_{s = s_{cl}} = 0,

and therefore maximizes the posterior:

{\frac{\partial P (s | d)}{\partial s} |}_{s = s_{cl}} = {\frac{\partial}{\partial s} \frac{e^{- H (d, s)}}{Z (d)} |}_{s = s_{cl}} = - P (d, s) \underset{= 0}{\underset{⏟}{{\frac{\partial H (d, s)}{\partial s} |}_{s = s_{cl}}}} = 0

The classical field

s_{cl}

is therefore the maximum a posteriori estimator of the field inference problem.

Critical filter

The Wiener filter problem requires the two point correlation $S \equiv ⟨ s s^{†} ⟩_{(s)}$ of a field to be known. If it is unknown, it has to be inferred along with the field itself. This requires the specification of a hyperprior $P (S)$ . Often, statistical homogeneity (translation invariance) can be assumed, implying that $S$ is diagonal in Fourier space (for $Ω = R^{u}$ being a $u$ dimensional Cartesian space). In this case, only the Fourier space power spectrum $P_{s} (\vec{k})$ needs to be inferred. Given a further assumption of statistical isotropy, this spectrum depends only on the length $k = | \vec{k} |$ of the Fourier vector $\vec{k}$ and only a one dimensional spectrum $P_{s} (k)$ has to be determined. The prior field covariance reads then in Fourier space coordinates $S_{\vec{k} \vec{q}} = (2 π)^{u} δ (\vec{k} - \vec{q}) P_{s} (k)$ .

If the prior on $P_{s} (k)$ is flat, the joint probability of data and spectrum is

\begin{aligned} P (d, P_{s}) & = \int D s P (d, s, P_{s}) \\ = \int D s P (d | s, P_{s}) P (s | P_{s}) P (P_{s}) \\ \propto \int D s G (d - R s, N) G (s, S) \\ \propto \frac{1}{| S |^{\frac{1}{2}}} \int D s \exp [- \frac{1}{2} (s^{†} D^{- 1} s - j^{†} s - s^{†} j)] \\ \propto \frac{| D |^{\frac{1}{2}}}{| S |^{\frac{1}{2}}} \exp [\frac{1}{2} j^{†} D j], \end{aligned}

where the notation of the information propagator

D = (S^{- 1} + R^{†} N^{- 1} R)^{- 1}

and source

j = R^{†} N^{- 1} d

of the Wiener filter problem was used again. The corresponding information Hamiltonian is

H (d, P_{s}) \hat{=} \frac{1}{2} [\ln | S D^{- 1} | - j^{†} D j] = \frac{1}{2} T r [\ln (S D^{- 1}) - j j^{†} D],

where

\hat{=}

denotes equality up to irrelevant constants (here: constant with respect to

P_{s}

). Minimizing this with respect to

P_{s}

, in order to get its maximum a posteriori power spectrum estimator, yields

\begin{aligned} \frac{\partial H (d, P_{s})}{\partial P_{s} (k)} & = \frac{1}{2} T r [D S^{- 1} \frac{\partial (S D^{- 1})}{\partial P_{s} (k)} - j j^{†} \frac{\partial D}{\partial P_{s} (k)}] \\ = \frac{1}{2} T r [D S^{- 1} \frac{\partial (1 + S R^{†} N^{- 1} R)}{\partial P_{s} (k)} + j j^{†} D \frac{\partial D^{- 1}}{\partial P_{s} (k)} D] \\ = \frac{1}{2} T r [D S^{- 1} \frac{\partial S}{\partial P_{s} (k)} R^{†} N^{- 1} R + m m^{†} \frac{\partial S^{- 1}}{\partial P_{s} (k)}] \\ = \frac{1}{2} T r [(R^{†} N^{- 1} R D S^{- 1} - S^{- 1} m m^{†} S^{- 1}) \frac{\partial S}{\partial P_{s} (k)}] \\ = \frac{1}{2} \int {(\frac{d q}{2 π})}^{u} \int {(\frac{d q^{'}}{2 π})}^{u} {((D^{- 1} - S^{- 1}) D S^{- 1} - S^{- 1} m m^{†} S^{- 1})}_{\vec{q} {\vec{q}}^{'}} \frac{\partial (2 π)^{u} δ (\vec{q} - {\vec{q}}^{'}) P_{s} (q)}{\partial P_{s} (k)} \\ = \frac{1}{2} \int {(\frac{d q}{2 π})}^{u} {(S^{- 1} - S^{- 1} D S^{- 1} - S^{- 1} m m^{†} S^{- 1})}_{\vec{q} \vec{q}} δ (k - q) \\ = \frac{1}{2} T r {S^{- 1} [S - (D + m m^{†})] S^{- 1} P_{k}} \\ = \frac{T r [P_{k}]}{2 P_{s} (k)} - \frac{T r [(D + m m^{†}) P_{k}]}{2 {[P_{s} (k)]}^{2}} = 0, \end{aligned}

where the Wiener filter mean

m = D j

and the spectral band projector

(P_{k})_{\vec{q} {\vec{q}}^{'}} \equiv (2 π)^{u} δ (\vec{q} - {\vec{q}}^{'}) δ (| \vec{q} | - k)

were introduced. The latter commutes with

S^{- 1}

, since

(S^{- 1})_{\vec{k} \vec{q}} = (2 π)^{u} δ (\vec{k} - \vec{q}) [P_{s} (k)]^{- 1}

is diagonal in Fourier space. The maximum a posteriori estimator for the power spectrum is therefore

P_{s} (k) = \frac{T r [(m m^{†} + D) P_{k}]}{T r [P_{k}]} .

It has to be calculated iteratively, as

m = D j

and

D = (S^{- 1} + R^{†} N^{- 1} R)^{- 1}

depend both on

P_{s}

themselves. In an empirical Bayes approach, the estimated

P_{s}

would be taken as given. As a consequence, the posterior mean estimate for the signal field is the corresponding

m

and its uncertainty the corresponding

D

in the empirical Bayes approximation.

The resulting non-linear filter is called the critical filter. The generalization of the power spectrum estimation formula as

P_{s} (k) = \frac{T r [(m m^{†} + δ D) P_{k}]}{T r [P_{k}]}

exhibits a perception thresholds for

δ < 1

, meaning that the data variance in a Fourier band has to exceed the expected noise level by a certain threshold before the signal reconstruction

m

becomes non-zero for this band. Whenever the data variance exceeds this threshold slightly, the signal reconstruction jumps to a finite excitation level, similar to a first order phase transition in thermodynamic systems. For filter with

δ = 1

perception of the signal starts continuously as soon the data variance exceeds the noise level. The disappearance of the discontinuous perception at

δ = 1

is similar to a thermodynamic system going through a critical point. Hence the name critical filter.

The critical filter, extensions thereof to non-linear measurements, and the inclusion of non-flat spectrum priors, permitted the application of IFT to real world signal inference problems, for which the signal covariance is usually unknown a priori.

IFT application examples

The generalized Wiener filter, that emerges in free IFT, is in broad usage in signal processing. Algorithms explicitly based on IFT were derived for a number of applications. Many of them are implemented using the Numerical Information Field Theory (NIFTy) library.

D³PO is a code for Denoising, Deconvolving, and Decomposing Photon Observations. It reconstructs images from individual photon count events taking into account the Poisson statistics of the counts and an instrument response function. It splits the sky emission into an image of diffuse emission and one of point sources, exploiting the different correlation structure and statistics of the two components for their separation. D³PO has been applied to data of the Fermi and the RXTE satellites.
RESOLVE is a Bayesian algorithm for aperture synthesis imaging in radio astronomy. RESOLVE is similar to D³PO, but it assumes a Gaussian likelihood and a Fourier space response function. It has been applied to data of the Very Large Array.
PySESA is a Python framework for Spatially Explicit Spectral Analysis for spatially explicit spectral analysis of point clouds and geospatial data.

Advanced theory

Many techniques from quantum field theory can be used to tackle IFT problems, like Feynman diagrams, effective actions, and the field operator formalism.

Feynman diagrams

In case the interaction coefficients $Λ^{(n)}$ in a Taylor-Fréchet expansion of the information Hamiltonian

H (d, s) = \underset{= H_{free} (d, s)}{\underset{⏟}{\frac{1}{2} s^{†} D^{- 1} s - j^{†} s + H_{0}}} + \underset{= H_{int} (d, s)}{\underset{⏟}{\sum_{n = 3}^{\infty} \frac{1}{n!} Λ_{x_{1} . . . x_{n}}^{(n)} s_{x_{1}} . . . s_{x_{n}}}},

are small, the log partition function, or Helmholtz free energy,

\ln Z (d) = \ln \int D s e^{- H (d, s)} = \sum_{c \in C} c

can be expanded asymptotically in terms of these coefficients. The free Hamiltonian specifies the mean

m = D j

and variance

D

of the Gaussian distribution

G (s - m, D)

over which the expansion is integrated. This leads to a sum over the set

C

of all connected Feynman diagrams. From the Helmholtz free energy, any connected moment of the field can be calculated via

⟨ s_{x_{1}} \dots s_{x_{n}} ⟩_{(s | d)}^{c} = \frac{\partial^{n} \ln Z}{\partial j_{x_{1}} \dots \partial j_{x_{n}}} .

Situations where small expansion parameters exist that are needed for such a diagrammatic expansion to converge are given by nearly Gaussian signal fields, where the non-Gaussianity of the field statistics leads to small interaction coefficients

Λ^{(n)}

. For example, the statistics of the Cosmic Microwave Background is nearly Gaussian, with small amounts of non-Gaussianities believed to be seeded during the inflationary epoch in the Early Universe.

Effective action

In order to have a stable numerics for IFT problems, a field functional that if minimized provides the posterior mean field is needed. Such is given by the effective action or Gibbs free energy of a field. The Gibbs free energy $G$ can be constructed from the Helmholtz free energy via a Legendre transformation. In IFT, it is given by the difference of the internal information energy

U = ⟨ H (d, s) ⟩_{P^{'} (s | d^{'})}

and the Shannon entropy

S = - \int D s P^{'} (s | d^{'}) \ln P^{'} (s | d^{'})

for temperature

T = 1

, where a Gaussian posterior approximation

P^{'} (s | d^{'}) = G (s - m, D)

is used with the approximate data

d^{'} = (m, D)

containing the mean and the dispersion of the field.

The Gibbs free energy is then

\begin{aligned} G (m, D) & = U (m, D) - T S (m, D) \\ = ⟨ H (d, s) + \ln P^{'} (s | d^{'}) ⟩_{P^{'} (s | d^{'})} \\ = \int D s P^{'} (s | d^{'}) \ln \frac{P^{'} (s | d^{'})}{P (d, s)} \\ = \int D s P^{'} (s | d^{'}) \ln \frac{P^{'} (s | d^{'})}{P (s | d) P (d)} \\ = \int D s P^{'} (s | d^{'}) \ln \frac{P^{'} (s | d^{'})}{P (s | d)} - \ln P (d) \\ = KL (P^{'} (s | d^{'}) | | P (s | d)) - \ln Z (d), \end{aligned}

the Kullback-Leibler divergence

KL (P^{'}, P)

between approximative and exact posterior plus the Helmholtz free energy. As the latter does not depend on the approximate data

d^{'} = (m, D)

, minimizing the Gibbs free energy is equivalent to minimizing the Kullback-Leibler divergence between approximate and exact posterior. Thus, the effective action approach of IFT is equivalent to the variational Bayesian methods, which also minimize the Kullback-Leibler divergence between approximate and exact posteriors.

Minimizing the Gibbs free energy provides approximatively the posterior mean field

⟨ s ⟩_{(s | d)} = \int D s s P (s | d),

whereas minimizing the information Hamiltonian provides the maximum a posteriori field. As the latter is known to over-fit noise, the former is usually a better field estimator.

Operator formalism

The calculation of the Gibbs free energy requires the calculation of Gaussian integrals over an information Hamiltonian, since the internal information energy is

U (m, D) = ⟨ H (d, s) ⟩_{P^{'} (s | d^{'})} = \int D s H (d, s) G (s - m, D) .

Such integrals can be calculated via a field operator formalism, in which

O_{m} = m + D \frac{d}{d m}

is the field operator. This generates the field expression

s

within the integral if applied to the Gaussian distribution function,

\begin{aligned} O_{m} G (s - m, D) & = (m + D \frac{d}{d m}) \frac{1}{| 2 π D |^{\frac{1}{2}}} \exp [- \frac{1}{2} (s - m)^{†} D^{- 1} (s - m)] \\ = (m + D D^{- 1} (s - m)) \frac{1}{| 2 π D |^{\frac{1}{2}}} \exp [- \frac{1}{2} (s - m)^{†} D^{- 1} (s - m)] \\ = s G (s - m, D), \end{aligned}

and any higher power of the field if applied several times,

\begin{aligned} (O_{m})^{n} G (s - m, D) & = s^{n} G (s - m, D) . \end{aligned}

If the information Hamiltonian is analytical, all its terms can be generated via the field operator

H (d, O_{m}) G (s - m, D) = H (d, s) G (s - m, D) .

As the field operator does not depend on the field

s

itself, it can be pulled out of the path integral of the internal information energy construction,

U (m, D) = \int D s H (d, O_{m}) G (s - m, D) = H (d, O_{m}) \int D s G (s - m, D) = H (d, O_{m}) 1_{m},

where

1_{m} = 1

should be regarded as a functional that always returns the value

1

irrespective the value of its input

m

. The resulting expression can be calculated by commuting the mean field annihilator

D \frac{d}{d m}

to the right of the expression, where they vanish since

\frac{d}{d m} 1_{m} = 0

. The mean field annihilator

D \frac{d}{d m}

commutes with the mean field as

[D \frac{d}{d m}, m] = D \frac{d}{d m} m - m D \frac{d}{d m} = D + m D \frac{d}{d m} - m D \frac{d}{d m} = D .

By the usage of the field operator formalism the Gibbs free energy can be calculated, which permits the (approximate) inference of the posterior mean field via a numerical robust functional minimization.

History

The book of Norbert Wiener might be regarded as one of the first works on field inference. The usage of path integrals for field inference was proposed by a number of authors, e.g. Edmund Bertschinger or William Bialek and A. Zee. The connection of field theory and Bayesian reasoning was made explicit by Jörg Lemm. The term information field theory was coined by Torsten Enßlin. See the latter reference for more information on the history of IFT.

Search This Blog

Sunday, December 3, 2023

Structural bioinformatics

Introduction

Protein structure

Structure visualization

DNA structure

Interactions

Calculating contacts

Protein Data Bank (PDB)

Data format

Other structural databases

Structure comparison

Structural alignment

Graph-based structural signatures

Structure prediction

Comparative modeling

De novo modeling

Structure validation

Prediction tools

Molecular docking

Virtual screening

Molecular dynamics

Applications

Tools

Information field theory

Motivation

Concepts

Bayesian inference

Information Hamiltonian

Fields

Path integrals

Gaussian prior

Measurement equation

Free theory

Free Hamiltonian

Generalized Wiener filter

Interacting theory

Interacting Hamiltonian

Classical field

Critical filter

IFT application examples

Advanced theory

Feynman diagrams

Effective action

Operator formalism

History

Romance (love)