Search This Blog

Thursday, November 15, 2018

Bayesian programming

From Wikipedia, the free encyclopedia

Bayesian programming is a formalism and a methodology for having a technique to specify probabilistic models and solve problems when less than the necessary information is available.  Bayes’ Theorem is the central concept behind this programming approach, which states that the probability of something occurring in the future can be inferred by past conditions related to the event.

Edwin T. Jaynes proposed that probability could be considered as an alternative and an extension of logic for rational reasoning with incomplete and uncertain information. In his founding book Probability Theory: The Logic of Science he developed this theory and proposed what he called “the robot,” which was not a physical device, but an inference engine to automate probabilistic reasoning—a kind of Prolog for probability instead of logic. Bayesian programming is a formal and concrete implementation of this "robot".

Bayesian programming may also be seen as an algebraic formalism to specify graphical models such as, for instance, Bayesian networks, dynamic Bayesian networks, Kalman filters or hidden Markov models. Indeed, Bayesian Programming is more general than Bayesian networks and has a power of expression equivalent to probabilistic factor graphs.

Formalism

A Bayesian program is a means of specifying a family of probability distributions.

The constituent elements of a Bayesian program are presented below:
  1. A program is constructed from a description and a question.
  2. A description is constructed using some specification () as given by the programmer and an identification or learning process for the parameters not completely specified by the specification, using a data set ().
  3. A specification is constructed from a set of pertinent variables, a decomposition and a set of forms.
  4. Forms are either parametric forms or questions to other Bayesian programs.
  5. A question specifies which probability distribution has to be computed.

Description

The purpose of a description is to specify an effective method of computing a joint probability distribution on a set of variables given a set of experimental data and some specification . This joint distribution is denoted as: .

To specify preliminary knowledge , the programmer must undertake the following:
  1. Define the set of relevant variables on which the joint distribution is defined.
  2. Decompose the joint distribution (break it into relevant independent or conditional probabilities).
  3. Define the forms of each of the distributions (e.g., for each variable, one of the list of probability distributions).

Decomposition

Given a partition of containing subsets, variables are defined , each corresponding to one of these subsets. Each variable is obtained as the conjunction of the variables belonging to the subset. Recursive application of Bayes' theorem leads to:
Conditional independence hypotheses then allow further simplifications. A conditional independence hypothesis for variable is defined by choosing some variable among the variables appearing in the conjunction , labelling as the conjunction of these chosen variables and setting:
We then obtain:
Such a simplification of the joint distribution as a product of simpler distributions is called a decomposition, derived using the chain rule.

This ensures that each variable appears at the most once on the left of a conditioning bar, which is the necessary and sufficient condition to write mathematically valid decompositions.

Forms

Each distribution appearing in the product is then associated with either a parametric form (i.e., a function ) or a question to another Bayesian program .

When it is a form , in general, is a vector of parameters that may depend on or or both. Learning takes place when some of these parameters are computed using the data set .

An important feature of Bayesian Programming is this capacity to use questions to other Bayesian programs as components of the definition of a new Bayesian program. is obtained by some inferences done by another Bayesian program defined by the specifications and the data . This is similar to calling a subroutine in classical programming and provides an easy way to build hierarchical models.

Question

Given a description (i.e., ), a question is obtained by partitioning into three sets: the searched variables, the known variables and the free variables.

The 3 variables , and are defined as the conjunction of the variables belonging to these sets.

A question is defined as the set of distributions:
made of many "instantiated questions" as the cardinal of , each instantiated question being the distribution:

Inference

Given the joint distribution , it is always possible to compute any possible question using the following general inference:
where the first equality results from the marginalization rule, the second results from Bayes' theorem and the third corresponds to a second application of marginalization. The denominator appears to be a normalization term and can be replaced by a constant .

Theoretically, this allows to solve any Bayesian inference problem. In practice, however, the cost of computing exhaustively and exactly is too great in almost all cases.

Replacing the joint distribution by its decomposition we get:
which is usually a much simpler expression to compute, as the dimensionality of the problem is considerably reduced by the decomposition into a product of lower dimension distributions.

Example

Bayesian spam detection

The purpose of Bayesian spam filtering is to eliminate junk e-mails.

The problem is very easy to formulate. E-mails should be classified into one of two categories: non-spam or spam. The only available information to classify the e-mails is their content: a set of words.
Using these words without taking the order into account is commonly called a bag of words model.

The classifier should furthermore be able to adapt to its user and to learn from experience. Starting from an initial standard setting, the classifier should modify its internal parameters when the user disagrees with its own decision. It will hence adapt to the user’s criteria to differentiate between non-spam and spam. It will improve its results as it encounters increasingly classified e-mails.

Variables

The variables necessary to write this program are as follows:
  1. : a binary variable, false if the e-mail is not spam and true otherwise.
  2. : binary variables. is true if the word of the dictionary is present in the text.
These binary variables sum up all the information about an e-mail.

Decomposition

Starting from the joint distribution and applying recursively Bayes' theorem we obtain:
This is an exact mathematical expression.

It can be drastically simplified by assuming that the probability of appearance of a word knowing the nature of the text (spam or not) is independent of the appearance of the other words. This is the naive Bayes assumption and this makes this spam filter a naive Bayes model.
For instance, the programmer can assume that:
to finally obtain:
This kind of assumption is known as the naive Bayes' assumption. It is "naive" in the sense that the independence between words is clearly not completely true. For instance, it completely neglects that the appearance of pairs of words may be more significant than isolated appearances. However, the programmer may assume this hypothesis and may develop the model and the associated inferences to test how reliable and efficient it is.

Parametric forms

To be able to compute the joint distribution, the programmer must now specify the distributions appearing in the decomposition:
  1. is a prior defined, for instance, by
  2. Each of the forms may be specified using Laplace rule of succession (this is a pseudocounts-based smoothing technique to counter the zero-frequency problem of words never-seen-before):
where stands for the number of appearances of the word in non-spam e-mails and stands for the total number of non-spam e-mails. Similarly, stands for the number of appearances of the word in spam e-mails and stands for the total number of spam e-mails.

Identification

The forms are not yet completely specified because the parameters , , and have no values yet.

The identification of these parameters could be done either by batch processing a series of classified e-mails or by an incremental updating of the parameters using the user's classifications of the e-mails as they arrive.

Both methods could be combined: the system could start with initial standard values of these parameters issued from a generic database, then some incremental learning customizes the classifier to each individual user.

Question

The question asked to the program is: "what is the probability for a given text to be spam knowing which words appear and don't appear in this text?" It can be formalized by:
which can be computed as follows:
The denominator appears to be a normalization constant. It is not necessary to compute it to decide if we are dealing with spam. For instance, an easy trick is to compute the ratio:
This computation is faster and easier because it requires only products.

Bayesian program

The Bayesian spam filter program is completely defined by:

Bayesian filter, Kalman filter and hidden Markov model

Bayesian filters (often called Recursive Bayesian estimation) are generic probabilistic models for time evolving processes. Numerous models are particular instances of this generic approach, for instance: the Kalman filter or the Hidden Markov model (HMM).

Variables

  • Variables are a time series of state variables considered to be on a time horizon ranging from to .
  • Variables are a time series of observation variables on the same horizon.

Decomposition

The decomposition is based:
  • on , called the system model, transition model or dynamic model, which formalizes the transition from the state at time to the state at time ;
  • on , called the observation model, which expresses what can be observed at time when the system is in state ;
  • on an initial state at time : .

Parametrical forms

The parametrical forms are not constrained and different choices lead to different well-known models: see Kalman filters and Hidden Markov models just below.

Question

The typical question for such models is : what is the probability distribution for the state at time knowing the observations from instant to ?
The most common case is Bayesian filtering where , which searches for the present state, knowing past observations.

However it is also possible , to extrapolate a future state from past observations, or to do smoothing , to recover a past state from observations made either before or after that instant.
More complicated questions may also be asked as shown below in the HMM section.

Bayesian filters have a very interesting recursive property, which contributes greatly to their attractiveness. may be computed simply from with the following formula:
Another interesting point of view for this equation is to consider that there are two phases: a prediction phase and an estimation phase:
  • During the prediction phase, the state is predicted using the dynamic model and the estimation of the state at the previous moment:
  • During the estimation phase, the prediction is either confirmed or invalidated using the last observation:

Bayesian program

Kalman filter

The very well-known Kalman filters are a special case of Bayesian filters.

They are defined by the following Bayesian program:
  • Variables are continuous.
  • The transition model and the observation model are both specified using Gaussian laws with means that are linear functions of the conditioning variables.
With these hypotheses and by using the recursive formula, it is possible to solve the inference problem analytically to answer the usual question. This leads to an extremely efficient algorithm, which explains the popularity of Kalman filters and the number of their everyday applications.

When there are no obvious linear transition and observation models, it is still often possible, using a first-order Taylor's expansion, to treat these models as locally linear. This generalization is commonly called the extended Kalman filter.

Hidden Markov model

Hidden Markov models (HMMs) are another very popular specialization of Bayesian filters.

They are defined by the following Bayesian program:
  • Variables are treated as being discrete.
  • The transition model and the observation model are
both specified using probability matrices.
  • The question most frequently asked of HMMs is:
What is the most probable series of states that leads to the present state, knowing the past observations?

This particular question may be answered with a specific and very efficient algorithm called the Viterbi algorithm.

The Baum–Welch algorithm has been developed for HMMs.

Applications

Academic applications

Since 2000, Bayesian programming has been used to develop both robotics applications and life sciences models.

Robotics

In robotics, bayesian programming was applied to autonomous robotics, robotic CAD systems, advanced driver-assistance systems, robotic arm control, mobile robotics, human-robot interaction, human-vehicle interaction (Bayesian autonomous driver models) video game avatar programming and training  and real-time strategy games (AI).

Life sciences

In life sciences, bayesian programming was used in vision to reconstruct shape from motion, to model visuo-vestibular interaction and to study saccadic eye movements; in speech perception and control to study early speech acquisition and the emergence of articulatory-acoustic systems; and to model handwriting perception and control.

Pattern recognition

Bayesian program learning has potential applications voice recognition and synthesis, image recognition and natural language processing. It employs the principles of compositionality (building abstract representations from parts), causality (building complexity from parts) and learning to learn (using previously recognized concepts to ease the creation of new concepts).

Possibility theories

The comparison between probabilistic approaches (not only bayesian programming) and possibility theories continues to be debated.

Possibility theories like, for instance, fuzzy sets, fuzzy logic and possibility theory are alternatives to probability to model uncertainty. They argue that probability is insufficient or inconvenient to model certain aspects of incomplete/uncertain knowledge.

The defense of probability is mainly based on Cox's theorem, which starts from four postulates concerning rational reasoning in the presence of uncertainty. It demonstrates that the only mathematical framework that satisfies these postulates is probability theory. The argument is that any approach other than probability necessarily infringes one of these postulates and the value of that infringement.

Probabilistic programming

The purpose of probabilistic programming is to unify the scope of classical programming languages with probabilistic modeling (especially bayesian networks) to deal with uncertainty while profiting from the programming languages' expressiveness to encode complexity.

Extended classical programming languages include logical languages as proposed in Probabilistic Horn Abduction, Independent Choice Logic, PRISM, and ProbLog which proposes an extension of Prolog.

It can also be extensions of functional programming languages (essentially Lisp and Scheme) such as IBAL or CHURCH. The underlying programming languages can be object-oriented as in BLOG and FACTORIE or more standard ones as in CES and FIGARO.

The purpose of Bayesian programming is different. Jaynes' precept of "probability as logic" argues that probability is an extension of and an alternative to logic above which a complete theory of rationality, computation and programming can be rebuilt. Bayesian programming attempts to replace classical languages with a programming approach based on probability that considers incompleteness and uncertainty.

The precise comparison between the semantics and power of expression of Bayesian and probabilistic programming is an open question.

Radiation And The Value Of A Human Life


The 63rd HPS meeting in Cleveland brought together the group of scientists most responsible for protecting us from the harms of radiation.Casper Sun

The 63rd Annual Meeting of the Health Physics Society wrapped up last week in the great city of Cleveland and there was a palpable air of excitement that things might change.

HPS represents the group of scientists most responsible for protecting us from the harms of radiation. They know we have ridiculously-low rad limits that can cause more harm than good, and also unnecessarily cost us hundreds of billions of dollars that could be used to save actual lives.

The venue for this meeting was fitting since Ohio is experiencing some of the national insanity of prematurely closing nuclear plants for no good reason other than unjustified fear of radiation and the intransigence of lawmakers and anti-nuclear groups to value the best low-carbon, safest energy source we have to fight global warming. Even in light of all climate change experts’ call for more nuclear, from Jim Hansen on down.


The assumption of LNT: any radiation dose no matter how small will cause harm. However, small doses of radiation, < 10 rem(cSv)/yr, appear to be easily handled by cellular repair mechanisms that evolved as a normal adaptive response with the emergence of the eukaryotic cell 2.3 billion years ago. It would be odd indeed if the upper end of Earth background radiation was not near the threshold for significant radiation-induced biological effects.Conca

This fear originated around 1959, when the world adopted a singularly foolish hypothesis for the negative biological effects of radiation, called the Linear No-Threshold hypothesis, or LNT and its resultant policy, ALARA (As Low As Reasonably Achievable).

LNT assumes, in contrast to almost all data on living organisms, that any radiation is bad and there is no threshold of radioactivity below which there is no risk, even Earth background radiation levels. Following ALARA means that we should protect everyone from all radiation, making doses as low as we possibly can, even if it costs billions.

Indeed, we do spend billions of dollars each year protecting against what was once background levels. It’s right out of a Road Warrior movie. No wonder the fear of radiation took over the worldview. Science fiction is much more fun to study than real science.

The Plenary Session at the HPS meeting was devoted to just these issues. Yours truly started off with a discussion of what these extra unnecessary costs might be. About $500 billion is a good estimate for America, over a trillion dollars for the world.

Dr. Roger Coates, President of the International Radiation Protection Association in London, followed with a discussion of ‘Prudence and the Hidden Burden of Conservatism’, outlining the strong support for developing a more practical and pragmatic approach to radiation protection.

While prudence is a noble concept that we use in our everyday lives to keep us from harm, the combined impact of accumulated layers of conservatism in deriving radiation limits has resulted in limits that are hundreds of times below the radiation doses that we should worry about, and at a great cost. Always choosing the worst-case scenario isn’t conservative, it’s just wrong. That’s why we developed statistics in the first place.

Dr. Antone Brooks gave the Landauer Lecture, title ‘Ya, But What If?’ that probed the fear that permeates society with ridiculous radiation scenarios like:

‘Ya, but what if the fallout from bomb testing in the 1950s and 60s created a cancer epidemic in Utah?’ No, Utah has the lowest cancer rate in the Nation and Washington County, where the fallout was the highest, has the second lowest cancer rate in the state. And is where Dr. Brooks grew up.

Or, ‘Ya, but what if radon is the second leading cause of lung cancer?’ No, radon alone is not very effective in producing lung cancer.  Radon combined with cigarette smoke increases lung cancer, but it’s the smoking that dominates that risk.  If you have radon in your home just stop smoking and put a fan in the basement.

DOE could save billions, and clean-up more sites, if society adopted a more reasonable limit for radiation, still well below background levels, but where no harm has ever been observed.  DOE
 
Most of the fears and questions surrounding radiation have been answered very well over the years and all studies point to the need to have a reasonable threshold for radiation below which we don’t have to worry about health effects. And we don’t have to spend billions protecting against a phantom menace.

Unless you, the reader, are in a boat out in the middle of the Pacific Ocean, you’re getting a radiation dose between 200 and 1,000 mrem/year (2 to 10 mSv/yr) in the United States, just from background sources such as rock, dirt, potato chips and cosmic rays (EPA Rad Limits) and the radioactive isotopes of uranium, thorium, radon and potassium that are in them.

(Yes, potato chips are the most radioactive food, but if you eat ten bags a day, it’s the salt and fat that will kill you, not the radiation!)

Some places in the world have background doses ten times higher than us, but there have never been any observable health effects from these doses. Ever. Anywhere.

On the other hand, regulations require nuclear waste disposal systems and clean-up standards to meet release criteria of less than 4 mrem/yr (0.04 mSv/yr) to downgradient drinking water supplies. Moving from Cleveland to western Colorado will give you an extra 100 mrem/yr (1 mSv/yr). Should we make moving to Colorado against the law? Yes, according to these regulations. No, according to common sense.

There are different types of radiation and different biological effects of each and it doesn’t matter whether the radiation is natural or man-made, they’re the same. The measure of dose (rem or Sv) takes all that into account. However, we have made our regulations act as if they are different which is totally unscientific.

It is not possible to see statistical evidence of public health risks at exposures less than 10,000 mrem/yr (100 mSv/yr) because any risk is well below the noise level of all other risks faced by humans or the environment. That’s why we will never see any deaths, or even excess cancers, from Fukushima radiation, but 1,600 people died in the days following the accident from the frantic forced evacuations that resulted mainly from fear.

It is useful to note that the debate surrounding LNT in the 1950s was all wrapped up in the Cold War and was used by China and the USSR to stop above-ground nuclear bomb tests by America, so I guess that was a good thing. LNT sounded like a nice conservative idea at the time, but little did we know the collateral damage that would follow, such as fear of radiological medical diagnostics and treatments that save millions of lives each year.

The underlying problem with LNT is that life originated on Earth in a much higher radiation background than exists today, long before we split the atom. When the eukaryotic cell emerged over 2 billion years ago (the type of cell that makes up all higher life forms including humans), background radiation was between 1,000 and 20,000 mrem/year (10 and 200 mSv/yr).

In order to survive and thrive, these cells developed very efficient mechanisms for repairing radiation damage at or below these levels. These same mechanisms were essential to life when oxygen first entered the atmosphere about the same time. Free oxygen is thousands of times more deadly than radiation and we have only recently understood the importance of anti-oxidants in molecular biology and in our food for overall health.

Radiation is one of the weakest mutagenic and cytogenic agents on Earth. That’s why it takes so much radiation to hurt anyone.

Health physicists get a little frustrated that, on the one hand, they know that there is no real risk from any radiation below background, but are required to keep to limits a hundred times lower. The public often asks, ‘if it’s so safe, why are you working so hard to keep it so low?’

Good question. And the answer is ‘Because we’re told to. It’s the rules.’

So what are the costs of regulating radiation doses to such absurdly low levels?

A better question might be – How much do we consider the value of a human life to be?

It depends on how you view it, and who is paying (1, 2, 3, 4):

$7 million is the value of a human life according to EPA.

$316,000 is the average paid out in health care over a life in America.

$129,000 is the average historic legal value of a human life in America.

$12,420 is the death benefit to families of deceased soldiers, although circumstances in combat can increase that.

$45 million is the value of a single healthy human body when chopped up and sold on the black market for body parts.

$2.5 billion is the amount we spend to save a single theoretical human life based on LNT, although it is doubtful we have saved any lives at these levels.

$100 is the cost to save a human life by immunizing against measles, diphtheria, and pertussis in subsaharan Africa.

So, we could save 25 million lives in Africa for the cost of saving one theoretical life from low-levels of radioactivity. This is nuts. And it creates an ethical dilemma we have not yet faced up to.

The presence or absence of a threshold dose for radiation is a societal decision that society has been left out of. That needs to change. We have real problems that need to be solved. Yes, we need to deal with radiation doses above, say, 5 rem, 10 rem, whatever level you want to draw as the threshold. But there’s always a threshold. No different than, mercury, cadmium, or lead. Everything is in the environment at some level, but the old adage that dose makes the poison is quite true.

Life on Earth has dealt with this issue for billions of years, it certainly will for the next billion or two. One part per billion of lead is no big deal. One part per million…now that’s a problem. It’s why we have a 15 part per billion threshold for Pb in drinking water.

Life has easily dealt with radiation levels up to 10 rem/year for over 3 billion years. It’s not a problem. 100 rem/year…now that’s a problem.  We need a threshold, or we spin ourselves into nonsense.

Like we always do.

Note: another scientific meeting on LNT is scheduled for this coming September 30 to October 3 in the Tri-Cities, Washington.

I have been a scientist in the field of the earth and environmental sciences for 33 years, specializing in geologic disposal of nuclear waste, energy-related research,…MORE

Dr. James Conca is an expert on energy, nuclear and dirty bombs, a planetary geologist, and a professional speaker. Follow him on Twitter @jimconca and see his book at Amazon.com

Entropy (information theory)

From Wikipedia, the free encyclopedia https://en.wikipedia.org/wiki/Entropy_(information_theory) In info...