Search This Blog

Wednesday, October 23, 2019

Quantum network

From Wikipedia, the free encyclopedia
 
Quantum networks form an important element of quantum computing and quantum communication systems. Quantum networks facilitate the transmission of information in the form of quantum bits, also called qubits, between physically separated quantum processors. A quantum processor is a small quantum computer being able to perform quantum logic gates on a certain number of qubits. Quantum networks work in a similar way to classical networks. The main difference, as will be detailed more in later paragraphs, is that quantum networking like quantum computing is better at solving certain problems, such as modeling quantum systems.

Basics

Quantum networks for computation

Networked quantum computing or distributed quantum computing works by linking multiple quantum processors through a quantum network by sending qubits in-between them. Doing this creates a quantum computing cluster and therefore creates more computing potential. Less powerful computers can be linked in this way to create one more powerful processor. This is analogous to connecting several classical computers to form a computer cluster in classical computing. Like classical computing this system is scale-able by adding more and more quantum computers to the network. Currently quantum processors are only separated by short distances.

Quantum networks for communication

In the realm of quantum communication, one wants to send qubits from one quantum processor to another over long distances. This way local quantum networks can be intra connected into a quantum internet. A quantum internet supports many applications, which derive their power from the fact that by creating quantum entangled qubits, information can be transmitted between the remote quantum processors. Most applications of a quantum internet require only very modest quantum processors. For most quantum internet protocols, such as quantum key distribution in quantum cryptography, it is sufficient if these processors are capable of preparing and measuring only a single qubit at a time. This is in contrast to quantum computing where interesting applications can only be realized if the (combined) quantum processors can easily simulate more qubits than a classical computer (around 60). Quantum internet applications require only small quantum processors, often just a single qubit, because quantum entanglement can already be realized between just two qubits. A simulation of an entangled quantum system on a classical computer can not simultaneously provide the same security and speed.

Overview of the elements of a quantum network

The basic structure of a quantum network and more generally a quantum internet is analogous to a classical network. First, we have end nodes on which applications are ultimately run. These end nodes are quantum processors of at least one qubit. Some applications of a quantum internet require quantum processors of several qubits as well as a quantum memory at the end nodes.

Second, to transport qubits from one node to another, we need communication lines. For the purpose of quantum communication, standard telecom fibers can be used. For networked quantum computing, in which quantum processors are linked at short distances, different wavelengths are chosen depending on the exact hardware platform of the quantum processor

Third, to make maximum use of communication infrastructure, one requires optical switches capable of delivering qubits to the intended quantum processor. These switches need to preserve quantum coherence, which makes them more challenging to realize than standard optical switches

Finally, one requires a quantum repeater to transport qubits over long distances. Repeaters appear in-between end nodes. Since qubits cannot be copied, classical signal amplification is not possible. By necessity, a quantum repeater works in a fundamentally different way than a classical repeater.

Elements of a quantum network

End nodes: quantum processors

End nodes can both receive and emit information. Telecommunication lasers and parametric down-conversion combined with photodetectors can be used for quantum key distribution. In this case, the end nodes can in many cases be very simple devices consisting only of beamsplitters and photodetectors. 

However, for many protocols more sophisticated end nodes are desirable. These systems provide advanced processing capabilities and can also be used as quantum repeaters. Their chief advantage is that they can store and retransmit quantum information without disrupting the underlying quantum state. The quantum state being stored can either be the relative spin of an electron in a magnetic field or the energy state of an electron. They can also perform quantum logic gates

One way of realizing such end nodes is by using color centers in diamond, such as the nitrogen-vacancy center. This system forms a small quantum processor featuring several qubits. NV centers can be utilized at room temperatures. Small scale quantum algorithms and quantum error correction has already been demonstrated in this system, as well as the ability to entangle two remote quantum processors, and perform deterministic quantum teleportation.

Another possible platform are quantum processors based on Ion traps, which utilize radio-frequency magnetic fields and lasers. In a multispecies trapped-ion node network, photons entangled with a parent atom are used to entangle different nodes. Also, cavity quantum electrodynamics (Cavity QED) is one possible method of doing this. In Cavity QED, photonic quantum states can be transferred to and from atomic quantum states stored in single atoms contained in optical cavities. This allows for the transfer of quantum states between single atoms using optical fiber in addition to the creation of remote entanglement between distant atoms.

Communication lines: physical layer

Over long distances, the primary method of operating quantum networks is to use optical networks and photon-based qubits. This is due to optical networks having a reduced chance of decoherence. Optical networks have the advantage of being able to re-use existing optical fiber. Alternately, free space networks can be implemented that transmit quantum information through the atmosphere or through a vacuum.

Fiber optic networks

Optical networks using existing telecommunication fiber can be implemented using hardware similar to existing telecommunication equipment. This fiber can be either single-mode or multi-mode, with multi-mode allowing for more precise communication. At the sender, a single photon source can be created by heavily attenuating a standard telecommunication laser such that the mean number of photons per pulse is less than 1. For receiving, an avalanche photodetector can be used. Various methods of phase or polarization control can be used such as interferometers and beam splitters. In the case of entanglement based protocols, entangled photons can be generated through spontaneous parametric down-conversion. In both cases, the telecom fiber can be multiplexed to send non-quantum timing and control signals.

Free space networks

Free space quantum networks operate similar to fiber optic networks but rely on line of sight between the communicating parties instead of using a fiber optic connection. Free space networks can typically support higher transmission rates than fiber optic networks and do not have to account for polarization scrambling caused by optical fiber. However, over long distances, free space communication is subject to an increased chance of environmental disturbance on the photons.

Importantly, free space communication is also possible from a satellite to the ground. A quantum satellite capable of entanglement distribution over a distance of 1,203 km has been demonstrated. The experimental exchange of single photons from a global navigation satellite system at a slant distance of 20,000 km has also been reported. These satellites can play an important role in linking smaller ground-based networks over larger distances.

Repeaters

Long distance communication is hindered by the effects of signal loss and decoherence inherent to most transport mediums such as optical fiber. In classical communication, amplifiers can be used to boost the signal during transmission, but in a quantum network amplifiers cannot be used since qubits cannot be copied – known as the no-cloning theorem. That is, to implement an amplifier, the complete state of the flying qubit would need to be determined, something which is both unwanted and impossible.

Trusted repeaters

An intermediary step which allows the testing of communication infrastructure are trusted repeaters. Importantly, a trusted repeater cannot be used to transmit qubits over long distances. Instead, a trusted repeater can only be used to perform quantum key distribution with the additional assumption that the repeater is trusted. Consider two end nodes A and B, and a trusted repeater R in the middle. A and R now perform quantum key distribution to generate a key . Similarly, R and B run quantum key distribution to generate a key . A and B can now obtain a key between themselves as follows: A sends to R encrypted with the key . R decrypts to obtain . R then re-encrypts using the key and sends it to B. B decrypts to obtain . A and B now share the key . The key is secure from an outside eavesdropper, but clearly the repeater R also knows . This means that any subsequent communication between A and B does not provide end to end security, but is only secure as long as A and B trust the repeater R.

Quantum repeaters

Diagram for quantum teleportation of a photon
 
A true quantum repeater allows the end to end generation of quantum entanglement, and thus - by using quantum teleportation - the end to end transmission of qubits. In quantum key distribution protocols one can test for such entanglement. This means that when making encryption keys, the sender and receiver are secure even if they do not trust the quantum repeater. Any other application of a quantum internet also requires the end to end transmission of qubits, and thus a quantum repeater.
Quantum repeaters allow entanglement and can be established at distant nodes without physically sending an entangled qubit the entire distance.

In this case, the quantum network consists of many short distance links of perhaps tens or hundreds of kilometers. In the simplest case of a single repeater, two pairs of entangled qubits are established: and located at the sender and the repeater, and a second pair and located at the repeater and the receiver. These initial entangled qubits can be easily created, for example through parametric down conversion, with one qubit physically transmitted to an adjacent node. At this point, the repeater can perform a bell measurement on the qubits and thus teleporting the quantum state of onto . This has the effect of "swapping" the entanglement such that and are now entangled at a distance twice that of the initial entangled pairs. It can be seen that a network of such repeaters can be used linearly or in a hierarchical fashion to establish entanglement over great distances.

Hardware platforms suitable as end nodes above can also function as quantum repeaters. However, there are also hardware platforms specific only to the task of acting as a repeater, without the capabilities of performing quantum gates.

Error correction

Error correction can be used in quantum repeaters. Due to technological limitations, however, the applicability is limited to very short distances as quantum error correction schemes capable of protecting qubits over long distances would require an extremely large amount of qubits and hence extremely large quantum computers.

Errors in communication can be broadly classified into two types: Loss errors (due to optical fiber/environment) and operation errors (such as depolarization, dephasing etc.). While redundancy can be used to detect and correct classical errors, redundant qubits cannot be created due to the no-cloning theorem. As a result, other types of error correction must be introduced such as the Shor code or one of a number of more general and efficient codes. All of these codes work by distributing the quantum information across multiple entangled qubits so that operation errors as well as loss errors can be corrected.

In addition to quantum error correction, classical error correction can be employed by quantum networks in special cases such as quantum key distribution. In these cases, the goal of the quantum communication is to securely transmit a string of classical bits. Traditional error correction codes such as Hamming codes can be applied to the bit string before encoding and transmission on the quantum network.

Entanglement purification

Quantum decoherence can occur when one qubit from a maximally entangled bell state is transmitted across a quantum network. Entanglement purification allows for the creation of nearly maximally entangled qubits from a large number of arbitrary weakly entangled qubits, and thus provides additional protection against errors. Entanglement purification (also known as Entanglement distillation) has already been demonstrated in Nitrogen-vacancy centers in diamond.

Applications

A quantum internet supports numerous applications, enabled by quantum entanglement. In general, quantum entanglement is well suited for tasks that require coordination, synchronization or privacy. 

Examples of such applications include quantum key distribution, clock synchronization, protocols for distributed system problems such as leader election or byzantine agreement, extending the baseline of telescopes, as well as position verification, secure identification and two-party cryptography in the noisy-storage model. A quantum internet also enables secure access to a quantum computer in the cloud. Specifically, a quantum internet enables very simple quantum devices to connect to a remote quantum computer in such a way that computations can be performed there without the quantum computer finding out what this computation actually is.

Secure communications

When it comes to communicating in any form the largest issue has always been keeping your communications private. From when couriers were used to send letters between ancient battle commanders to secure radio communications that exist today the main purpose is to ensure that what a sender sends out to the receiver reaches the receiver unmolested. This is an area in which Quantum Networks particularly excel. By applying a quantum operator that the user selects to a system of information the information can then be sent to the receiver without a chance of an eavesdropper being able to accurately be able to record the sent information without either the sender or receiver knowing. This works because if a listener tries to listen in then they will change the information in an unintended way by listening thereby tipping their hand to the people on whom they are attacking. Secondly, without the proper quantum operator to decode the information they will corrupt the sent information without being able to use it themselves.

Jamming protection

Quantum networks can also be used to protect against jamming. A user can use a quantum network by using frequency-hopping spread spectrum. This method is currently used by the United States Army. In this method the user hops from frequency to frequency many times a second so that it is hard for an attacker to keep up and successfully attack the user. Direct-sequence spread spectrum can be used by applying a quantum operator to the system and then freely transmitting the information over the frequencies because an attacker cannot read the information without knowing the key (a quantum operator). These two techniques can be used together to produce a more secure communications system.

Frequency-hopping spread spectrum

Frequency-hopping spread spectrum (FHSS) is a method of protecting information transfer that involves the user switching from one frequency to another frequency hundreds of times a second. For this method to work one computer is set as the main computer and will regulate when the other computers will switch frequencies and how often. By switching frequencies hundreds of times a second a user can be assured that any would be attacker will have an extremely hard time both trying to read the data and trying to jam the frequency.

Direct-sequence spread spectrum

Direct-sequence spread spectrum (DSSS) is a method of protecting information transfer that involves the user applying a predetermined quantum operator to the information that is being sent so that only the receiver and the sender can decipher the information using the operator. This method makes it difficult for a potential listener to eavesdrop because without the operator they will not be able to determine the information. At the same time if a listener does try to decode the sent information by doing so they will change the information which will immediately tell the receiver that someone is listening to them.

Jamming

When using any computer to communicate with another computer the name of the game is security. "Attackers", people who want to receive information that was not intended for them or people who want to stop the proper receiver of the transmission from receiving their information. Quantum networks are particularly useful in this area as there are many different types of jamming techniques that are found in both classical and quantum systems.
Spot jamming
Spot jamming is a process wherein an attacker fully attacks one frequency at a time. For this method to be successful the attacker must send their transmission with more power than the original sender. By doing this the attacker will essentially overpower the original sender's message. The problem with this method is that it takes a tremendous amount of power to overpower a transmission as stated. Another issue with this method is that the original sender can easily switch to another frequency and if the original sender is using frequency-hopping spread spectrum the user will switch frequencies automatically with little hindrance to the original sender.
Sweep jamming
Sweep jamming is similar to spot jamming except it switches rapidly from one frequency to another in rapid succession. In this method the attacker is still attacking by sending a much more powerful message at the same time as the original sender. The advantage of this method over spot jamming is that sweep jamming has a much larger chance of disrupting the sender's frequency and costs the same amount of energy as spot.
Barrage jamming
Barrage jamming is when an attacker attacks many frequencies at one time, but as the range grows the ability to jam decreases. By attacking a few frequencies at a time the attacker increases the change that they might hit one of the sender's frequencies. The main problem with this method is that the attacker's power is greatly lessened because they are attacking many frequencies at once and therefore they decrease their power overall so it is possible that the attacker could hit the sender's frequency and not affect it due to the low power of their jamming frequency.

Current status

Quantum internet

At present, there is no network connecting quantum processors, or quantum repeaters deployed outside a lab.

Quantum key distribution networks

Several test networks have been deployed that are tailored to the task of quantum key distribution either at short distances (but connecting many users), or over larger distances by relying on trusted repeaters. These networks do not yet allow for the end to end transmission of qubits or the end to end creation of entanglement between far away nodes.

Major quantum network projects and QKD protocols implemented
Quantum network Start BB84 BBM92 E91 DPS COW
DARPA Quantum Network 2001 Yes No No No No
SECOCQ QKD network in Vienna 2003 Yes Yes No No Yes
Tokyo QKD network 2009 Yes Yes No Yes No
Hierarchical network in Wuhu, China 2009 Yes No No No No
Geneva area network (SwissQuantum) 2010 Yes No No No Yes
DARPA Quantum Network
Starting in the early 2000s, DARPA began sponsorship of a quantum network development project with the aim of implementing secure communication. The DARPA Quantum Network became operational within the BBN Technologies laboratory in late 2003 and was expanded further in 2004 to include nodes at Harvard and Boston Universities. The network consists of multiple physical layers including fiber optics supporting phase-modulated lasers and entangled photons as well free-space links.
SECOQC Vienna QKD network
From 2003 to 2008 the Secure Communication based on Quantum Cryptography (SECOQC) project developed a collaborative network between a number of European institutions. The architecture chosen for the SECOQC project is a trusted repeater architecture which consists of point-to-point quantum links between devices where long distance communication is accomplished through the use of repeaters.
Chinese hierarchical network
In May 2009, a hierarchical quantum network was demonstrated in Wuhu, China. The hierarchical network consists of a backbone network of four nodes connecting a number of subnets. The backbone nodes are connected through an optical switching quantum router. Nodes within each subnet are also connected through an optical switch and are connected to the backbone network through a trusted relay.
Geneva area network (SwissQuantum)
The SwissQuantum network developed and tested between 2009 and 2011 linked facilities at CERN with the University of Geneva and hepia in Geneva. The SwissQuantum program focused on transitioning the technologies developed in the SECOQC and other research quantum networks into a production environment. In particular the integration with existing telecommunication networks, and its reliability and robustness.
Tokyo QKD network
In 2010, a number of organizations from Japan and the European Union setup and tested the Tokyo QKD network. The Tokyo network build upon existing QKD technologies and adopted a SECOQC like network architecture. For the first time, one-time-pad encryption was implemented at high enough data rates to support popular end-user application such as secure voice and video conferencing. Previous large-scale QKD networks typically used classical encryption algorithms such as AES for high-rate data transfer and use the quantum-derived keys for low rate data or for regularly re-keying the classical encryption algorithms.
Beijing-Shanghai Trunk Line
In September 2017, a 2000-km quantum key distribution network between Beijing and Shanghai, China, was officially opened. This trunk line will serve as a backbone connecting quantum networks in Beijing, Shanghai, Jinan in Shandong province and Hefei in Anhui province. During the opening ceremony, two employees from the Bank of Communications completed a transaction from Shanghai to Beijing using the network. The State Grid Corporation of China is also developing a managing application for the link. The line uses 32 trusted nodes as repeaters. A quantum telecommunication network has been also put into service in Wuhan, capital of central China's Hubei Province, which will be connected to the trunk. Other similar city quantum networks along the Yangtze River are planned to follow.

Friendship paradox

From Wikipedia, the free encyclopedia
 
The friendship paradox is the phenomenon first observed by the sociologist Scott L. Feld in 1991 that most people have fewer friends than their friends have, on average. It can be explained as a form of sampling bias in which people with greater numbers of friends have an increased likelihood of being observed among one's own friends. In contradiction to this, most people believe that they have more friends than their friends have.

The same observation can be applied more generally to social networks defined by other relations than friendship: for instance, most people's sexual partners have had (on the average) a greater number of sexual partners than they have.

Mathematical explanation

In spite of its apparently paradoxical nature, the phenomenon is real, and can be explained as a consequence of the general mathematical properties of social networks. The mathematics behind this are directly related to the arithmetic-geometric mean inequality and the Cauchy–Schwarz inequality.

Formally, Feld assumes that a social network is represented by an undirected graph G = (V, E), where the set V of vertices corresponds to the people in the social network, and the set E of edges corresponds to the friendship relation between pairs of people. That is, he assumes that friendship is a symmetric relation: if X is a friend of Y, then Y is a friend of X. He models the average number of friends of a person in the social network as the average of the degrees of the vertices in the graph. That is, if vertex v has d(v) edges touching it (representing a person who has d(v) friends), then the average number μ of friends of a random person in the graph is
The average number of friends that a typical friend has can be modeled by choosing a random person (who has at least one friend), and then calculating how many friends their friends have on average. This amounts to choosing, uniformly at random, an edge of the graph (representing a pair of friends) and an endpoint of that edge (one of the friends), and again calculating the degree of the selected endpoint. The probability of a certain vertex to be chosen is :
The first factor corresponds to how likely it is that the chosen edge contains the vertex, which increases when the vertex has more friends. The halving factor simply comes from the fact that each edge has two vertices. So the expected value of the number of friends of a (randomly chosen) friend is :
We know from the definition of variance that :
where is the variance of the degrees in the graph. This allows us to compute the desired expected value :
For a graph that has vertices of varying degrees (as is typical for social networks), both μ and are positive, which implies that the average degree of a friend is strictly greater than the average degree of a random node. 

Another way of understanding how the first term came is as follows. For each friendship (u, v), a node u mentions that v is a friend and v has d(v) friends. There are d(v) such friends who mention this. Hence the square of d(v) term. We add this for all such friendships in the network from both the u's and v's perspective, which gives the numerator. The denominator is the number of total such friendships, which is twice the total edges in the network (one from the u's perspective and the other from the v's). 

After this analysis, Feld goes on to make some more qualitative assumptions about the statistical correlation between the number of friends that two friends have, based on theories of social networks such as assortative mixing, and he analyzes what these assumptions imply about the number of people whose friends have more friends than they do. Based on this analysis, he concludes that in real social networks, most people are likely to have fewer friends than the average of their friends' numbers of friends. However, this conclusion is not a mathematical certainty; there exist undirected graphs (such as the graph formed by removing a single edge from a large complete graph) that are unlikely to arise as social networks but in which most vertices have higher degree than the average of their neighbors' degrees.

Applications

The analysis of the friendship paradox implies that the friends of randomly selected individuals are likely to have higher than average centrality. This observation has been used as a way to forecast and slow the course of epidemics, by using this random selection process to choose individuals to immunize or monitor for infection while avoiding the need for a complex computation of the centrality of all nodes in the network.

A study in 2010 by Christakis and Fowler showed that flu outbreaks can be detected almost 2 weeks before traditional surveillance measures can by using the friendship paradox in monitoring the infection in a social network. They found that using the friendship paradox to analyze the health of central friends is "an ideal way to predict outbreaks, but detailed information doesn't exist for most groups, and to produce it would be time-consuming and costly."

The "generalized friendship paradox" states that the friendship paradox applies to other characteristics as well. For example, one's co-authors are on average likely to be more prominent, with more publications, more citations and more collaborators, or one's followers on Twitter have more followers. The same effect has also been demonstrated for Subjective Well-Being by Bollen et al (2017), who used a large-scale Twitter network and longitudinal data on subjective well-being for each individual in the network to demonstrate that both a Friendship and a "happiness" paradox can occur in online social networks.

Tuesday, October 22, 2019

Sampling bias

From Wikipedia, the free encyclopedia

In statistics, sampling bias is a bias in which a sample is collected in such a way that some members of the intended population have a lower sampling probability than others. It results in a biased sample, a non-random sample of a population (or non-human factors) in which all individuals, or instances, were not equally likely to have been selected. If this is not accounted for, results can be erroneously attributed to the phenomenon under study rather than to the method of sampling.

Medical sources sometimes refer to sampling bias as ascertainment bias. Ascertainment bias has basically the same definition, but is still sometimes classified as a separate type of bias.

Distinction from selection bias

Sampling bias is mostly classified as a subtype of selection bias, sometimes specifically termed sample selection bias, but some classify it as a separate type of bias. A distinction, albeit not universally accepted, of sampling bias is that it undermines the external validity of a test (the ability of its results to be generalized to the entire population), while selection bias mainly addresses internal validity for differences or similarities found in the sample at hand. In this sense, errors occurring in the process of gathering the sample or cohort cause sampling bias, while errors in any process thereafter cause selection bias.

However, selection bias and sampling bias are often used synonymously.

Types

  • Selection from a specific real area. For example, a survey of high school students to measure teenage use of illegal drugs will be a biased sample because it does not include home-schooled students or dropouts. A sample is also biased if certain members are underrepresented or overrepresented relative to others in the population. For example, a "man on the street" interview which selects people who walk by a certain location is going to have an overrepresentation of healthy individuals who are more likely to be out of the home than individuals with a chronic illness. This may be an extreme form of biased sampling, because certain members of the population are totally excluded from the sample (that is, they have zero probability of being selected).
  • Self-selection bias, which is possible whenever the group of people being studied has any form of control over whether to participate (as current standards of human-subject research ethics require for many real-time and some longitudinal forms of study). Participants' decision to participate may be correlated with traits that affect the study, making the participants a non-representative sample. For example, people who have strong opinions or substantial knowledge may be more willing to spend time answering a survey than those who do not. Another example is online and phone-in polls, which are biased samples because the respondents are self-selected. Those individuals who are highly motivated to respond, typically individuals who have strong opinions, are overrepresented, and individuals that are indifferent or apathetic are less likely to respond. This often leads to a polarization of responses with extreme perspectives being given a disproportionate weight in the summary. As a result, these types of polls are regarded as unscientific.
  • Pre-screening of trial participants, or advertising for volunteers within particular groups. For example, a study to "prove" that smoking does not affect fitness might recruit at the local fitness center, but advertise for smokers during the advanced aerobics class, and for non-smokers during the weight loss sessions.
  • Exclusion bias results from exclusion of particular groups from the sample, e.g. exclusion of subjects who have recently migrated into the study area (this may occur when newcomers are not available in a register used to identify the source population). Excluding subjects who move out of the study area during follow-up is rather equivalent of dropout or nonresponse, a selection bias in that it rather affects the internal validity of the study.
  • Healthy user bias, when the study population is likely healthier than the general population. For example, someone in poor health is unlikely to have a job as manual laborer.
  • Berkson's fallacy, when the study population is selected from a hospital and so is less healthy than the general population. This can result in a spurious negative correlation between diseases: a hospital patient without diabetes is more likely to have another given disease such as cholecystitis, since they must have had some reason to enter the hospital in the first place.
  • Overmatching, matching for an apparent confounder that actually is a result of the exposure. The control group becomes more similar to the cases in regard to exposure than does the general population.
  • Survivorship bias, in which only "surviving" subjects are selected, ignoring those that fell out of view. For example, using the record of current companies as an indicator of business climate or economy ignores the businesses that failed and no longer exist.
  • Malmquist bias, an effect in observational astronomy which leads to the preferential detection of intrinsically bright objects.

Symptom-based sampling

The study of medical conditions begins with anecdotal reports. By their nature, such reports only include those referred for diagnosis and treatment. A child who can't function in school is more likely to be diagnosed with dyslexia than a child who struggles but passes. A child examined for one condition is more likely to be tested for and diagnosed with other conditions, skewing comorbidity statistics. As certain diagnoses become associated with behavior problems or intellectual disability, parents try to prevent their children from being stigmatized with those diagnoses, introducing further bias. Studies carefully selected from whole populations are showing that many conditions are much more common and usually much milder than formerly believed.

Truncate selection in pedigree studies

Simple pedigree example of sampling bias
 
Geneticists are limited in how they can obtain data from human populations. As an example, consider a human characteristic. We are interested in deciding if the characteristic is inherited as a simple Mendelian trait. Following the laws of Mendelian inheritance, if the parents in a family do not have the characteristic, but carry the allele for it, they are carriers (e.g. a non-expressive heterozygote). In this case their children will each have a 25% chance of showing the characteristic. The problem arises because we can't tell which families have both parents as carriers (heterozygous) unless they have a child who exhibits the characteristic. The description follows the textbook by Sutton.

The figure shows the pedigrees of all the possible families with two children when the parents are carriers (Aa).
  • Nontruncate selection. In a perfect world we should be able to discover all such families with a gene including those who are simply carriers. In this situation the analysis would be free from ascertainment bias and the pedigrees would be under "nontruncate selection" In practice, most studies identify, and include, families in a study based upon them having affected individuals.
  • Truncate selection. When afflicted individuals have an equal chance of being included in a study this is called truncate selection, signifying the inadvertent exclusion (truncation) of families who are carriers for a gene. Because selection is performed on the individual level, families with two or more affected children would have a higher probability of becoming included in the study.
  • Complete truncate selection is a special case where each family with an affected child has an equal chance of being selected for the study.
The probabilities of each of the families being selected is given in the figure, with the sample frequency of affected children also given. In this simple case, the researcher will look for a frequency of ​47 or ​58 for the characteristic, depending on the type of truncate selection used.

The caveman effect

An example of selection bias is called the "caveman effect". Much of our understanding of prehistoric peoples comes from caves, such as cave paintings made nearly 40,000 years ago. If there had been contemporary paintings on trees, animal skins or hillsides, they would have been washed away long ago. Similarly, evidence of fire pits, middens, burial sites, etc. are most likely to remain intact to the modern era in caves. Prehistoric people are associated with caves because that is where the data still exists, not necessarily because most of them lived in caves for most of their lives.

Problems due to sampling bias

Sampling bias is problematic because it is possible that a statistic computed of the sample is systematically erroneous. Sampling bias can lead to a systematic over- or under-estimation of the corresponding parameter in the population. Sampling bias occurs in practice as it is practically impossible to ensure perfect randomness in sampling. If the degree of misrepresentation is small, then the sample can be treated as a reasonable approximation to a random sample. Also, if the sample does not differ markedly in the quantity being measured, then a biased sample can still be a reasonable estimate. 

The word bias has a strong negative connotation. Indeed, biases sometimes come from deliberate intent to mislead or other scientific fraud. In statistical usage, bias merely represents a mathematical property, no matter if it is deliberate or unconscious or due to imperfections in the instruments used for observation. While some individuals might deliberately use a biased sample to produce misleading results, more often, a biased sample is just a reflection of the difficulty in obtaining a truly representative sample, or ignorance of the bias in their process of measurement or analysis. An example of how ignorance of a bias can exist is in the widespread use of a ratio (a.k.a. fold change) as a measure of difference in biology. Because it is easier to achieve a large ratio with two small numbers with a given difference, and relatively more difficult to achieve a large ratio with two large numbers with a larger difference, large significant differences may be missed when comparing relatively large numeric measurements. Some have called this a 'demarcation bias' because the use of a ratio (division) instead of a difference (subtraction) removes the results of the analysis from science into pseudoscience.

Some samples use a biased statistical design which nevertheless allows the estimation of parameters. The U.S. National Center for Health Statistics, for example, deliberately oversamples from minority populations in many of its nationwide surveys in order to gain sufficient precision for estimates within these groups. These surveys require the use of sample weights (see later on) to produce proper estimates across all ethnic groups. Provided that certain conditions are met (chiefly that the weights are calculated and used correctly) these samples permit accurate estimation of population parameters.

Historical examples

Example of biased sample: as of June 2008 55% of web browsers (Internet Explorer) in use did not pass the Acid2 test. Due to the nature of the test, the sample consisted mostly of web developers.
 
A classic example of a biased sample and the misleading results it produced occurred in 1936. In the early days of opinion polling, the American Literary Digest magazine collected over two million postal surveys and predicted that the Republican candidate in the U.S. presidential election, Alf Landon, would beat the incumbent president, Franklin Roosevelt, by a large margin. The result was the exact opposite. The Literary Digest survey represented a sample collected from readers of the magazine, supplemented by records of registered automobile owners and telephone users. This sample included an over-representation of individuals who were rich, who, as a group, were more likely to vote for the Republican candidate. In contrast, a poll of only 50 thousand citizens selected by George Gallup's organization successfully predicted the result, leading to the popularity of the Gallup poll

Another classic example occurred in the 1948 presidential election. On election night, the Chicago Tribune printed the headline DEWEY DEFEATS TRUMAN, which turned out to be mistaken. In the morning the grinning president-elect, Harry S. Truman, was photographed holding a newspaper bearing this headline. The reason the Tribune was mistaken is that their editor trusted the results of a phone survey. Survey research was then in its infancy, and few academics realized that a sample of telephone users was not representative of the general population. Telephones were not yet widespread, and those who had them tended to be prosperous and have stable addresses. (In many cities, the Bell System telephone directory contained the same names as the Social Register). In addition, the Gallup poll that the Tribune based its headline on was over two weeks old at the time of the printing.

Statistical corrections for a biased sample

If entire segments of the population are excluded from a sample, then there are no adjustments that can produce estimates that are representative of the entire population. But if some groups are underrepresented and the degree of underrepresentation can be quantified, then sample weights can correct the bias. However, the success of the correction is limited to the selection model chosen. If certain variables are missing the methods used to correct the bias could be inaccurate.

For example, a hypothetical population might include 10 million men and 10 million women. Suppose that a biased sample of 100 patients included 20 men and 80 women. A researcher could correct for this imbalance by attaching a weight of 2.5 for each male and 0.625 for each female. This would adjust any estimates to achieve the same expected value as a sample that included exactly 50 men and 50 women, unless men and women differed in their likelihood of taking part in the survey.

Selection bias

From Wikipedia, the free encyclopedia
 
Selection bias is the bias introduced by the selection of individuals, groups or data for analysis in such a way that proper randomization is not achieved, thereby ensuring that the sample obtained is not representative of the population intended to be analyzed. It is sometimes referred to as the selection effect. The phrase "selection bias" most often refers to the distortion of a statistical analysis, resulting from the method of collecting samples. If the selection bias is not taken into account, then some conclusions of the study may be false.

Types

There are many types of possible selection bias, including:

Sampling bias

Sampling bias is systematic error due to a non-random sample of a population, causing some members of the population to be less likely to be included than others, resulting in a biased sample, defined as a statistical sample of a population (or non-human factors) in which all participants are not equally balanced or objectively represented. It is mostly classified as a subtype of selection bias, sometimes specifically termed sample selection bias, but some classify it as a separate type of bias.

A distinction of sampling bias (albeit not a universally accepted one) is that it undermines the external validity of a test (the ability of its results to be generalized to the rest of the population), while selection bias mainly addresses internal validity for differences or similarities found in the sample at hand. In this sense, errors occurring in the process of gathering the sample or cohort cause sampling bias, while errors in any process thereafter cause selection bias.

Examples of sampling bias include self-selection, pre-screening of trial participants, discounting trial subjects/tests that did not run to completion and migration bias by excluding subjects who have recently moved into or out of the study area.

Time interval

  • Early termination of a trial at a time when its results support the desired conclusion.
  • A trial may be terminated early at an extreme value (often for ethical reasons), but the extreme value is likely to be reached by the variable with the largest variance, even if all variables have a similar mean.

Exposure

  • Susceptibility bias
    • Clinical susceptibility bias, when one disease predisposes for a second disease, and the treatment for the first disease erroneously appears to predispose to the second disease. For example, postmenopausal syndrome gives a higher likelihood of also developing endometrial cancer, so estrogens given for the postmenopausal syndrome may receive a higher than actual blame for causing endometrial cancer.
    • Protopathic bias, when a treatment for the first symptoms of a disease or other outcome appear to cause the outcome. It is a potential bias when there is a lag time from the first symptoms and start of treatment before actual diagnosis. It can be mitigated by lagging, that is, exclusion of exposures that occurred in a certain time period before diagnosis.
    • Indication bias, a potential mixup between cause and effect when exposure is dependent on indication, e.g. a treatment is given to people in high risk of acquiring a disease, potentially causing a preponderance of treated people among those acquiring the disease. This may cause an erroneous appearance of the treatment being a cause of the disease.

Data

  • Partitioning (dividing) data with knowledge of the contents of the partitions, and then analyzing them with tests designed for blindly chosen partitions.
  • Post hoc alteration of data inclusion based on arbitrary or subjective reasons, including:
    • Cherry picking, which actually is not selection bias, but confirmation bias, when specific subsets of data are chosen to support a conclusion (e.g. citing examples of plane crashes as evidence of airline flight being unsafe, while ignoring the far more common example of flights that complete safely.)
    • Rejection of bad data on (1) arbitrary grounds, instead of according to previously stated or generally agreed criteria or (2) discarding "outliers" on statistical grounds that fail to take into account important information that could be derived from "wild" observations.

Studies

  • Selection of which studies to include in a meta-analysis (see also combinatorial meta-analysis).
  • Performing repeated experiments and reporting only the most favorable results, perhaps relabelling lab records of other experiments as "calibration tests", "instrumentation errors" or "preliminary surveys".
  • Presenting the most significant result of a data dredge as if it were a single experiment (which is logically the same as the previous item, but is seen as much less dishonest).

Attrition

Attrition bias is a kind of selection bias caused by attrition (loss of participants), discounting trial subjects/tests that did not run to completion. It is closely related to the survivorship bias, where only the subjects that "survived" a process are included in the analysis or the failure bias, where only the subjects that "failed" a process are included. It includes dropout, nonresponse (lower response rate), withdrawal and protocol deviators. It gives biased results where it is unequal in regard to exposure and/or outcome. For example, in a test of a dieting program, the researcher may simply reject everyone who drops out of the trial, but most of those who drop out are those for whom it was not working. Different loss of subjects in intervention and comparison group may change the characteristics of these groups and outcomes irrespective of the studied intervention.

Observer selection

Philosopher Nick Bostrom has argued that data are filtered not only by study design and measurement, but by the necessary precondition that there has to be someone doing a study. In situations where the existence of the observer or the study is correlated with the data, observation selection effects occur, and anthropic reasoning is required.

An example is the past impact event record of Earth: if large impacts cause mass extinctions and ecological disruptions precluding the evolution of intelligent observers for long periods, no one will observe any evidence of large impacts in the recent past (since they would have prevented intelligent observers from evolving). Hence there is a potential bias in the impact record of Earth. Astronomical existential risks might similarly be underestimated due to selection bias, and an anthropic correction has to be introduced.

Mitigation

In the general case, selection biases cannot be overcome with statistical analysis of existing data alone, though Heckman correction may be used in special cases. An assessment of the degree of selection bias can be made by examining correlations between exogenous (background) variables and a treatment indicator. However, in regression models, it is correlation between unobserved determinants of the outcome and unobserved determinants of selection into the sample which bias estimates, and this correlation between unobservables cannot be directly assessed by the observed determinants of treatment.

Related issues

Selection bias is closely related to:
  • publication bias or reporting bias, the distortion produced in community perception or meta-analyses by not publishing uninteresting (usually negative) results, or results which go against the experimenter's prejudices, a sponsor's interests, or community expectations.
  • confirmation bias, the general tendency of humans to give more attention to whatever confirms our pre-existing perspective; or specifically in experimental science, the distortion produced by experiments that are designed to seek confirmatory evidence instead of trying to disprove the hypothesis.
  • exclusion bias, results from applying different criteria to cases and controls in regards to participation eligibility for a study/different variables serving as basis for exclusion.

Fearmongering

From Wikipedia, the free encyclopedia https://en.wikipedia.org/wiki/Fearmongering Fearmongering ,...