Search This Blog

Friday, July 7, 2023

Linear regression

From Wikipedia, the free encyclopedia

In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.

In linear regression, the relationships are modeled using linear predictor functions whose unknown model parameters are estimated from the data. Such models are called linear models. Most commonly, the conditional mean of the response given the values of the explanatory variables (or predictors) is assumed to be an affine function of those values; less commonly, the conditional median or some other quantile is used. Like all forms of regression analysis, linear regression focuses on the conditional probability distribution of the response given the values of the predictors, rather than on the joint probability distribution of all of these variables, which is the domain of multivariate analysis.

Linear regression was the first type of regression analysis to be studied rigorously, and to be used extensively in practical applications. This is because models which depend linearly on their unknown parameters are easier to fit than models which are non-linearly related to their parameters and because the statistical properties of the resulting estimators are easier to determine.

Linear regression has many practical uses. Most applications fall into one of the following two broad categories:

  • If the goal is error reduction in prediction or forecasting, linear regression can be used to fit a predictive model to an observed data set of values of the response and explanatory variables. After developing such a model, if additional values of the explanatory variables are collected without an accompanying response value, the fitted model can be used to make a prediction of the response.
  • If the goal is to explain variation in the response variable that can be attributed to variation in the explanatory variables, linear regression analysis can be applied to quantify the strength of the relationship between the response and the explanatory variables, and in particular to determine whether some explanatory variables may have no linear relationship with the response at all, or to identify which subsets of explanatory variables may contain redundant information about the response.

Linear regression models are often fitted using the least squares approach, but they may also be fitted in other ways, such as by minimizing the "lack of fit" in some other norm (as with least absolute deviations regression), or by minimizing a penalized version of the least squares cost function as in ridge regression (L2-norm penalty) and lasso (L1-norm penalty). Conversely, the least squares approach can be used to fit models that are not linear models. Thus, although the terms "least squares" and "linear model" are closely linked, they are not synonymous.

Formulation

In linear regression, the observations (red) are assumed to be the result of random deviations (green) from an underlying relationship (blue) between a dependent variable (y) and an independent variable (x).

Given a data set of n statistical units, a linear regression model assumes that the relationship between the dependent variable y and the vector of regressors x is linear. This relationship is modeled through a disturbance term or error variable ε — an unobserved random variable that adds "noise" to the linear relationship between the dependent variable and regressors. Thus the model takes the form

where T denotes the transpose, so that xiTβ is the inner product between vectors xi and β.

Often these n equations are stacked together and written in matrix notation as

where

Notation and terminology

  • is a vector of observed values of the variable called the regressand, endogenous variable, response variable, measured variable, criterion variable, or dependent variable. This variable is also sometimes known as the predicted variable, but this should not be confused with predicted values, which are denoted . The decision as to which variable in a data set is modeled as the dependent variable and which are modeled as the independent variables may be based on a presumption that the value of one of the variables is caused by, or directly influenced by the other variables. Alternatively, there may be an operational reason to model one of the variables in terms of the others, in which case there need be no presumption of causality.
  • may be seen as a matrix of row-vectors or of n-dimensional column-vectors , which are known as regressors, exogenous variables, explanatory variables, covariates, input variables, predictor variables, or independent variables (not to be confused with the concept of independent random variables). The matrix is sometimes called the design matrix.
    • Usually a constant is included as one of the regressors. In particular, for . The corresponding element of β is called the intercept. Many statistical inference procedures for linear models require an intercept to be present, so it is often included even if theoretical considerations suggest that its value should be zero.
    • Sometimes one of the regressors can be a non-linear function of another regressor or of the data, as in polynomial regression and segmented regression. The model remains linear as long as it is linear in the parameter vector β.
    • The values xij may be viewed as either observed values of random variables Xj or as fixed values chosen prior to observing the dependent variable. Both interpretations may be appropriate in different cases, and they generally lead to the same estimation procedures; however different approaches to asymptotic analysis are used in these two situations.
  • is a -dimensional parameter vector, where is the intercept term (if one is included in the model—otherwise is p-dimensional). Its elements are known as effects or regression coefficients (although the latter term is sometimes reserved for the estimated effects). In simple linear regression, p=1, and the coefficient is known as regression slope. Statistical estimation and inference in linear regression focuses on β. The elements of this parameter vector are interpreted as the partial derivatives of the dependent variable with respect to the various independent variables.
  • is a vector of values . This part of the model is called the error term, disturbance term, or sometimes noise (in contrast with the "signal" provided by the rest of the model). This variable captures all other factors which influence the dependent variable y other than the regressors x. The relationship between the error term and the regressors, for example their correlation, is a crucial consideration in formulating a linear regression model, as it will determine the appropriate estimation method.

Fitting a linear model to a given data set usually requires estimating the regression coefficients such that the error term is minimized. For example, it is common to use the sum of squared errors as a measure of for minimization.

Example

Consider a situation where a small ball is being tossed up in the air and then we measure its heights of ascent hi at various moments in time ti. Physics tells us that, ignoring the drag, the relationship can be modeled as

where β1 determines the initial velocity of the ball, β2 is proportional to the standard gravity, and εi is due to measurement errors. Linear regression can be used to estimate the values of β1 and β2 from the measured data. This model is non-linear in the time variable, but it is linear in the parameters β1 and β2; if we take regressors xi = (xi1, xi2)  = (ti, ti2), the model takes on the standard form

Assumptions

Standard linear regression models with standard estimation techniques make a number of assumptions about the predictor variables, the response variables and their relationship. Numerous extensions have been developed that allow each of these assumptions to be relaxed (i.e. reduced to a weaker form), and in some cases eliminated entirely. Generally these extensions make the estimation procedure more complex and time-consuming, and may also require more data in order to produce an equally precise model.

Example of a cubic polynomial regression, which is a type of linear regression. Although polynomial regression fits a nonlinear model to the data, as a statistical estimation problem it is linear, in the sense that the regression function E(y | x) is linear in the unknown parameters that are estimated from the data. For this reason, polynomial regression is considered to be a special case of multiple linear regression.

The following are the major assumptions made by standard linear regression models with standard estimation techniques (e.g. ordinary least squares):

  • Weak exogeneity. This essentially means that the predictor variables x can be treated as fixed values, rather than random variables. This means, for example, that the predictor variables are assumed to be error-free—that is, not contaminated with measurement errors. Although this assumption is not realistic in many settings, dropping it leads to significantly more difficult errors-in-variables models.
  • Linearity. This means that the mean of the response variable is a linear combination of the parameters (regression coefficients) and the predictor variables. Note that this assumption is much less restrictive than it may at first seem. Because the predictor variables are treated as fixed values (see above), linearity is really only a restriction on the parameters. The predictor variables themselves can be arbitrarily transformed, and in fact multiple copies of the same underlying predictor variable can be added, each one transformed differently. This technique is used, for example, in polynomial regression, which uses linear regression to fit the response variable as an arbitrary polynomial function (up to a given degree) of a predictor variable. With this much flexibility, models such as polynomial regression often have "too much power", in that they tend to overfit the data. As a result, some kind of regularization must typically be used to prevent unreasonable solutions coming out of the estimation process. Common examples are ridge regression and lasso regression. Bayesian linear regression can also be used, which by its nature is more or less immune to the problem of overfitting. (In fact, ridge regression and lasso regression can both be viewed as special cases of Bayesian linear regression, with particular types of prior distributions placed on the regression coefficients.)
  • Constant variance (a.k.a. homoscedasticity). This means that the variance of the errors does not depend on the values of the predictor variables. Thus the variability of the responses for given fixed values of the predictors is the same regardless of how large or small the responses are. This is often not the case, as a variable whose mean is large will typically have a greater variance than one whose mean is small. For example, a person whose income is predicted to be $100,000 may easily have an actual income of $80,000 or $120,000—i.e., a standard deviation of around $20,000—while another person with a predicted income of $10,000 is unlikely to have the same $20,000 standard deviation, since that would imply their actual income could vary anywhere between −$10,000 and $30,000. (In fact, as this shows, in many cases—often the same cases where the assumption of normally distributed errors fails—the variance or standard deviation should be predicted to be proportional to the mean, rather than constant.) The absence of homoscedasticity is called heteroscedasticity. In order to check this assumption, a plot of residuals versus predicted values (or the values of each individual predictor) can be examined for a "fanning effect" (i.e., increasing or decreasing vertical spread as one moves left to right on the plot). A plot of the absolute or squared residuals versus the predicted values (or each predictor) can also be examined for a trend or curvature. Formal tests can also be used; see Heteroscedasticity. The presence of heteroscedasticity will result in an overall "average" estimate of variance being used instead of one that takes into account the true variance structure. This leads to less precise (but in the case of ordinary least squares, not biased) parameter estimates and biased standard errors, resulting in misleading tests and interval estimates. The mean squared error for the model will also be wrong. Various estimation techniques including weighted least squares and the use of heteroscedasticity-consistent standard errors can handle heteroscedasticity in a quite general way. Bayesian linear regression techniques can also be used when the variance is assumed to be a function of the mean. It is also possible in some cases to fix the problem by applying a transformation to the response variable (e.g., fitting the logarithm of the response variable using a linear regression model, which implies that the response variable itself has a log-normal distribution rather than a normal distribution).
To check for violations of the assumptions of linearity, constant variance, and independence of errors within a linear regression model, the residuals are typically plotted against the predicted values (or each of the individual predictors). An apparently random scatter of points about the horizontal midline at 0 is ideal, but cannot rule out certain kinds of violations such as autocorrelation in the errors or their correlation with one or more covariates.
  • Independence of errors. This assumes that the errors of the response variables are uncorrelated with each other. (Actual statistical independence is a stronger condition than mere lack of correlation and is often not needed, although it can be exploited if it is known to hold.) Some methods such as generalized least squares are capable of handling correlated errors, although they typically require significantly more data unless some sort of regularization is used to bias the model towards assuming uncorrelated errors. Bayesian linear regression is a general way of handling this issue.
  • Lack of perfect multicollinearity in the predictors. For standard least squares estimation methods, the design matrix X must have full column rank p; otherwise perfect multicollinearity exists in the predictor variables, meaning a linear relationship exists between two or more predictor variables. This can be caused by accidentally duplicating a variable in the data, using a linear transformation of a variable along with the original (e.g., the same temperature measurements expressed in Fahrenheit and Celsius), or including a linear combination of multiple variables in the model, such as their mean. It can also happen if there is too little data available compared to the number of parameters to be estimated (e.g., fewer data points than regression coefficients). Near violations of this assumption, where predictors are highly but not perfectly correlated, can reduce the precision of parameter estimates (see Variance inflation factor). In the case of perfect multicollinearity, the parameter vector β will be non-identifiable—it has no unique solution. In such a case, only some of the parameters can be identified (i.e., their values can only be estimated within some linear subspace of the full parameter space Rp). See partial least squares regression. Methods for fitting linear models with multicollinearity have been developed, some of which require additional assumptions such as "effect sparsity"—that a large fraction of the effects are exactly zero. Note that the more computationally expensive iterated algorithms for parameter estimation, such as those used in generalized linear models, do not suffer from this problem.

Beyond these assumptions, several other statistical properties of the data strongly influence the performance of different estimation methods:

  • The statistical relationship between the error terms and the regressors plays an important role in determining whether an estimation procedure has desirable sampling properties such as being unbiased and consistent.
  • The arrangement, or probability distribution of the predictor variables x has a major influence on the precision of estimates of β. Sampling and design of experiments are highly developed subfields of statistics that provide guidance for collecting data in such a way to achieve a precise estimate of β.

Interpretation

The data sets in the Anscombe's quartet are designed to have approximately the same linear regression line (as well as nearly identical means, standard deviations, and correlations) but are graphically very different. This illustrates the pitfalls of relying solely on a fitted model to understand the relationship between variables.

A fitted linear regression model can be used to identify the relationship between a single predictor variable xj and the response variable y when all the other predictor variables in the model are "held fixed". Specifically, the interpretation of βj is the expected change in y for a one-unit change in xj when the other covariates are held fixed—that is, the expected value of the partial derivative of y with respect to xj. This is sometimes called the unique effect of xj on y. In contrast, the marginal effect of xj on y can be assessed using a correlation coefficient or simple linear regression model relating only xj to y; this effect is the total derivative of y with respect to xj.

Care must be taken when interpreting regression results, as some of the regressors may not allow for marginal changes (such as dummy variables, or the intercept term), while others cannot be held fixed (recall the example from the introduction: it would be impossible to "hold ti fixed" and at the same time change the value of ti2).

It is possible that the unique effect can be nearly zero even when the marginal effect is large. This may imply that some other covariate captures all the information in xj, so that once that variable is in the model, there is no contribution of xj to the variation in y. Conversely, the unique effect of xj can be large while its marginal effect is nearly zero. This would happen if the other covariates explained a great deal of the variation of y, but they mainly explain variation in a way that is complementary to what is captured by xj. In this case, including the other variables in the model reduces the part of the variability of y that is unrelated to xj, thereby strengthening the apparent relationship with xj.

The meaning of the expression "held fixed" may depend on how the values of the predictor variables arise. If the experimenter directly sets the values of the predictor variables according to a study design, the comparisons of interest may literally correspond to comparisons among units whose predictor variables have been "held fixed" by the experimenter. Alternatively, the expression "held fixed" can refer to a selection that takes place in the context of data analysis. In this case, we "hold a variable fixed" by restricting our attention to the subsets of the data that happen to have a common value for the given predictor variable. This is the only interpretation of "held fixed" that can be used in an observational study.

The notion of a "unique effect" is appealing when studying a complex system where multiple interrelated components influence the response variable. In some cases, it can literally be interpreted as the causal effect of an intervention that is linked to the value of a predictor variable. However, it has been argued that in many cases multiple regression analysis fails to clarify the relationships between the predictor variables and the response variable when the predictors are correlated with each other and are not assigned following a study design.

Group effects

In a multiple linear regression model

parameter of predictor variable represents the individual effect of . It has an interpretation as the expected change in the response variable when increases by one unit with other predictor variables held constant. When is strongly correlated with other predictor variables, it is improbable that can increase by one unit with other variables held constant. In this case, the interpretation of becomes problematic as it is based on an improbable condition, and the effect of cannot be evaluated in isolation.

For a group of predictor variables, say, , a group effect is defined as a linear combination of their parameters

where is a weight vector satisfying . Because of the constraint on , is also referred to as a normalized group effect. A group effect has an interpretation as the expected change in when variables in the group change by the amount , respectively, at the same time with variables not in the group held constant. It generalizes the individual effect of a variable to a group of variables in that () if , then the group effect reduces to an individual effect, and () if and for , then the group effect also reduces to an individual effect. A group effect is said to be meaningful if the underlying simultaneous changes of the variables is probable.

Group effects provide a means to study the collective impact of strongly correlated predictor variables in linear regression models. Individual effects of such variables are not well-defined as their parameters do not have good interpretations. Furthermore, when the sample size is not large, none of their parameters can be accurately estimated by the least squares regression due to the multicollinearity problem. Nevertheless, there are meaningful group effects that have good interpretations and can be accurately estimated by the least squares regression. A simple way to identify these meaningful group effects is to use an all positive correlations (APC) arrangement of the strongly correlated variables under which pairwise correlations among these variables are all positive, and standardize all predictor variables in the model so that they all have mean zero and length one. To illustrate this, suppose that is a group of strongly correlated variables in an APC arrangement and that they are not strongly correlated with predictor variables outside the group. Let be the centred and be the standardized . Then, the standardized linear regression model is

Parameters in the original model, including , are simple functions of in the standardized model. The standardization of variables does not change their correlations, so is a group of strongly correlated variables in an APC arrangement and they are not strongly correlated with other predictor variables in the standardized model. A group effect of is

and its minimum-variance unbiased linear estimator is

where is the least squares estimator of . In particular, the average group effect of the standardized variables is

which has an interpretation as the expected change in when all in the strongly correlated group increase by th of a unit at the same time with variables outside the group held constant. With strong positive correlations and in standardized units, variables in the group are approximately equal, so they are likely to increase at the same time and in similar amount. Thus, the average group effect is a meaningful effect. It can be accurately estimated by its minimum-variance unbiased linear estimator , even when individually none of the can be accurately estimated by .

Not all group effects are meaningful or can be accurately estimated. For example, is a special group effect with weights and for , but it cannot be accurately estimated by . It is also not a meaningful effect. In general, for a group of strongly correlated predictor variables in an APC arrangement in the standardized model, group effects whose weight vectors are at or near the centre of the simplex () are meaningful and can be accurately estimated by their minimum-variance unbiased linear estimators. Effects with weight vectors far away from the centre are not meaningful as such weight vectors represent simultaneous changes of the variables that violate the strong positive correlations of the standardized variables in an APC arrangement. As such, they are not probable. These effects also cannot be accurately estimated.

Applications of the group effects include (1) estimation and inference for meaningful group effects on the response variable, (2) testing for "group significance" of the variables via testing versus , and (3) characterizing the region of the predictor variable space over which predictions by the least squares estimated model are accurate.

A group effect of the original variables can be expressed as a constant times a group effect of the standardized variables . The former is meaningful when the latter is. Thus meaningful group effects of the original variables can be found through meaningful group effects of the standardized variables.

Extensions

Numerous extensions of linear regression have been developed, which allow some or all of the assumptions underlying the basic model to be relaxed.

Simple and multiple linear regression

Example of simple linear regression, which has one independent variable

The very simplest case of a single scalar predictor variable x and a single scalar response variable y is known as simple linear regression. The extension to multiple and/or vector-valued predictor variables (denoted with a capital X) is known as multiple linear regression, also known as multivariable linear regression (not to be confused with multivariate linear regression).

Multiple linear regression is a generalization of simple linear regression to the case of more than one independent variable, and a special case of general linear models, restricted to one dependent variable. The basic model for multiple linear regression is

for each observation i = 1, ... , n.

In the formula above we consider n observations of one dependent variable and p independent variables. Thus, Yi is the ith observation of the dependent variable, Xij is ith observation of the jth independent variable, j = 1, 2, ..., p. The values βj represent parameters to be estimated, and εi is the ith independent identically distributed normal error.

In the more general multivariate linear regression, there is one equation of the above form for each of m > 1 dependent variables that share the same set of explanatory variables and hence are estimated simultaneously with each other:

for all observations indexed as i = 1, ... , n and for all dependent variables indexed as j = 1, ... , m.

Nearly all real-world regression models involve multiple predictors, and basic descriptions of linear regression are often phrased in terms of the multiple regression model. Note, however, that in these cases the response variable y is still a scalar. Another term, multivariate linear regression, refers to cases where y is a vector, i.e., the same as general linear regression.

General linear models

The general linear model considers the situation when the response variable is not a scalar (for each observation) but a vector, yi. Conditional linearity of is still assumed, with a matrix B replacing the vector β of the classical linear regression model. Multivariate analogues of ordinary least squares (OLS) and generalized least squares (GLS) have been developed. "General linear models" are also called "multivariate linear models". These are not the same as multivariable linear models (also called "multiple linear models").

Heteroscedastic models

Various models have been created that allow for heteroscedasticity, i.e. the errors for different response variables may have different variances. For example, weighted least squares is a method for estimating linear regression models when the response variables may have different error variances, possibly with correlated errors. (See also Weighted linear least squares, and Generalized least squares.) Heteroscedasticity-consistent standard errors is an improved method for use with uncorrelated but potentially heteroscedastic errors.

Generalized linear models

Generalized linear models (GLMs) are a framework for modeling response variables that are bounded or discrete. This is used, for example:

  • when modeling positive quantities (e.g. prices or populations) that vary over a large scale—which are better described using a skewed distribution such as the log-normal distribution or Poisson distribution (although GLMs are not used for log-normal data, instead the response variable is simply transformed using the logarithm function);
  • when modeling categorical data, such as the choice of a given candidate in an election (which is better described using a Bernoulli distribution/binomial distribution for binary choices, or a categorical distribution/multinomial distribution for multi-way choices), where there are a fixed number of choices that cannot be meaningfully ordered;
  • when modeling ordinal data, e.g. ratings on a scale from 0 to 5, where the different outcomes can be ordered but where the quantity itself may not have any absolute meaning (e.g. a rating of 4 may not be "twice as good" in any objective sense as a rating of 2, but simply indicates that it is better than 2 or 3 but not as good as 5).

Generalized linear models allow for an arbitrary link function, g, that relates the mean of the response variable(s) to the predictors: . The link function is often related to the distribution of the response, and in particular it typically has the effect of transforming between the range of the linear predictor and the range of the response variable.

Some common examples of GLMs are:

Single index models[clarification needed] allow some degree of nonlinearity in the relationship between x and y, while preserving the central role of the linear predictor βx as in the classical linear regression model. Under certain conditions, simply applying OLS to data from a single-index model will consistently estimate β up to a proportionality constant.

Hierarchical linear models

Hierarchical linear models (or multilevel regression) organizes the data into a hierarchy of regressions, for example where A is regressed on B, and B is regressed on C. It is often used where the variables of interest have a natural hierarchical structure such as in educational statistics, where students are nested in classrooms, classrooms are nested in schools, and schools are nested in some administrative grouping, such as a school district. The response variable might be a measure of student achievement such as a test score, and different covariates would be collected at the classroom, school, and school district levels.

Errors-in-variables

Errors-in-variables models (or "measurement error models") extend the traditional linear regression model to allow the predictor variables X to be observed with error. This error causes standard estimators of β to become biased. Generally, the form of bias is an attenuation, meaning that the effects are biased toward zero.

Others

  • In Dempster–Shafer theory, or a linear belief function in particular, a linear regression model may be represented as a partially swept matrix, which can be combined with similar matrices representing observations and other assumed normal distributions and state equations. The combination of swept or unswept matrices provides an alternative method for estimating linear regression models.

Estimation methods

A large number of procedures have been developed for parameter estimation and inference in linear regression. These methods differ in computational simplicity of algorithms, presence of a closed-form solution, robustness with respect to heavy-tailed distributions, and theoretical assumptions needed to validate desirable statistical properties such as consistency and asymptotic efficiency.

Some of the more common estimation techniques for linear regression are summarized below.

Least-squares estimation and related techniques

Francis Galton's 1886 illustration of the correlation between the heights of adults and their parents. The observation that adult children's heights tended to deviate less from the mean height than their parents suggested the concept of "regression toward the mean", giving regression its name. The "locus of horizontal tangential points" passing through the leftmost and rightmost points on the ellipse (which is a level curve of the bivariate normal distribution estimated from the data) is the OLS estimate of the regression of parents' heights on children's heights, while the "locus of vertical tangential points" is the OLS estimate of the regression of children's heights on parent's heights. The major axis of the ellipse is the TLS estimate.

Assuming that the independent variable is and the model's parameters are , then the model's prediction would be

.

If is extended to then would become a dot product of the parameter and the independent variable, i.e.

.

In the least-squares setting, the optimum parameter is defined as such that minimizes the sum of mean squared loss:

Now putting the independent and dependent variables in matrices and respectively, the loss function can be rewritten as:

As the loss is convex the optimum solution lies at gradient zero. The gradient of the loss function is (using Denominator layout convention):

Setting the gradient to zero produces the optimum parameter:

Note: To prove that the obtained is indeed the local minimum, one needs to differentiate once more to obtain the Hessian matrix and show that it is positive definite. This is provided by the Gauss–Markov theorem.

Linear least squares methods include mainly:

Maximum-likelihood estimation and related techniques

  • Maximum likelihood estimation can be performed when the distribution of the error terms is known to belong to a certain parametric family ƒθ of probability distributions. When fθ is a normal distribution with zero mean and variance θ, the resulting estimate is identical to the OLS estimate. GLS estimates are maximum likelihood estimates when ε follows a multivariate normal distribution with a known covariance matrix.
  • Ridge regression and other forms of penalized estimation, such as Lasso regression, deliberately introduce bias into the estimation of β in order to reduce the variability of the estimate. The resulting estimates generally have lower mean squared error than the OLS estimates, particularly when multicollinearity is present or when overfitting is a problem. They are generally used when the goal is to predict the value of the response variable y for values of the predictors x that have not yet been observed. These methods are not as commonly used when the goal is inference, since it is difficult to account for the bias.
  • Least absolute deviation (LAD) regression is a robust estimation technique in that it is less sensitive to the presence of outliers than OLS (but is less efficient than OLS when no outliers are present). It is equivalent to maximum likelihood estimation under a Laplace distribution model for ε.
  • Adaptive estimation. If we assume that error terms are independent of the regressors, , then the optimal estimator is the 2-step MLE, where the first step is used to non-parametrically estimate the distribution of the error term.

Other estimation techniques

Comparison of the Theil–Sen estimator (black) and simple linear regression (blue) for a set of points with outliers
  • Bayesian linear regression applies the framework of Bayesian statistics to linear regression. (See also Bayesian multivariate linear regression.) In particular, the regression coefficients β are assumed to be random variables with a specified prior distribution. The prior distribution can bias the solutions for the regression coefficients, in a way similar to (but more general than) ridge regression or lasso regression. In addition, the Bayesian estimation process produces not a single point estimate for the "best" values of the regression coefficients but an entire posterior distribution, completely describing the uncertainty surrounding the quantity. This can be used to estimate the "best" coefficients using the mean, mode, median, any quantile (see quantile regression), or any other function of the posterior distribution.
  • Quantile regression focuses on the conditional quantiles of y given X rather than the conditional mean of y given X. Linear quantile regression models a particular conditional quantile, for example the conditional median, as a linear function βTx of the predictors.
  • Mixed models are widely used to analyze linear regression relationships involving dependent data when the dependencies have a known structure. Common applications of mixed models include analysis of data involving repeated measurements, such as longitudinal data, or data obtained from cluster sampling. They are generally fit as parametric models, using maximum likelihood or Bayesian estimation. In the case where the errors are modeled as normal random variables, there is a close connection between mixed models and generalized least squares. Fixed effects estimation is an alternative approach to analyzing this type of data.
  • Principal component regression (PCR) is used when the number of predictor variables is large, or when strong correlations exist among the predictor variables. This two-stage procedure first reduces the predictor variables using principal component analysis, and then uses the reduced variables in an OLS regression fit. While it often works well in practice, there is no general theoretical reason that the most informative linear function of the predictor variables should lie among the dominant principal components of the multivariate distribution of the predictor variables. The partial least squares regression is the extension of the PCR method which does not suffer from the mentioned deficiency.
  • Least-angle regression is an estimation procedure for linear regression models that was developed to handle high-dimensional covariate vectors, potentially with more covariates than observations.
  • The Theil–Sen estimator is a simple robust estimation technique that chooses the slope of the fit line to be the median of the slopes of the lines through pairs of sample points. It has similar statistical efficiency properties to simple linear regression but is much less sensitive to outliers.
  • Other robust estimation techniques, including the α-trimmed mean approach, and L-, M-, S-, and R-estimators have been introduced.

Applications

Linear regression is widely used in biological, behavioral and social sciences to describe possible relationships between variables. It ranks as one of the most important tools used in these disciplines.

Trend line

A trend line represents a trend, the long-term movement in time series data after other components have been accounted for. It tells whether a particular data set (say GDP, oil prices or stock prices) have increased or decreased over the period of time. A trend line could simply be drawn by eye through a set of data points, but more properly their position and slope is calculated using statistical techniques like linear regression. Trend lines typically are straight lines, although some variations use higher degree polynomials depending on the degree of curvature desired in the line.

Trend lines are sometimes used in business analytics to show changes in data over time. This has the advantage of being simple. Trend lines are often used to argue that a particular action or event (such as training, or an advertising campaign) caused observed changes at a point in time. This is a simple technique, and does not require a control group, experimental design, or a sophisticated analysis technique. However, it suffers from a lack of scientific validity in cases where other potential changes can affect the data.

Epidemiology

Early evidence relating tobacco smoking to mortality and morbidity came from observational studies employing regression analysis. In order to reduce spurious correlations when analyzing observational data, researchers usually include several variables in their regression models in addition to the variable of primary interest. For example, in a regression model in which cigarette smoking is the independent variable of primary interest and the dependent variable is lifespan measured in years, researchers might include education and income as additional independent variables, to ensure that any observed effect of smoking on lifespan is not due to those other socio-economic factors. However, it is never possible to include all possible confounding variables in an empirical analysis. For example, a hypothetical gene might increase mortality and also cause people to smoke more. For this reason, randomized controlled trials are often able to generate more compelling evidence of causal relationships than can be obtained using regression analyses of observational data. When controlled experiments are not feasible, variants of regression analysis such as instrumental variables regression may be used to attempt to estimate causal relationships from observational data.

Finance

The capital asset pricing model uses linear regression as well as the concept of beta for analyzing and quantifying the systematic risk of an investment. This comes directly from the beta coefficient of the linear regression model that relates the return on the investment to the return on all risky assets.

Economics

Linear regression is the predominant empirical tool in economics. For example, it is used to predict consumption spending, fixed investment spending, inventory investment, purchases of a country's exports, spending on imports, the demand to hold liquid assets, labor demand, and labor supply.

Environmental science

Linear regression finds application in a wide range of environmental science applications. In Canada, the Environmental Effects Monitoring Program uses statistical analyses on fish and benthic surveys to measure the effects of pulp mill or metal mine effluent on the aquatic ecosystem.

Machine learning

Linear regression plays an important role in the subfield of artificial intelligence known as machine learning. The linear regression algorithm is one of the fundamental supervised machine-learning algorithms due to its relative simplicity and well-known properties.

History

Least squares linear regression, as a means of finding a good rough linear fit to a set of points was performed by Legendre (1805) and Gauss (1809) for the prediction of planetary movement. Quetelet was responsible for making the procedure well-known and for using it extensively in the social sciences.

Biology and sexual orientation

From Wikipedia, the free encyclopedia

The relationship between biology and sexual orientation is a subject of research. While scientists do not know the exact cause of sexual orientation, they theorize that it is caused by a complex interplay of genetic, hormonal, and environmental influences. Hypotheses for the impact of the post-natal social environment on sexual orientation, however, are weak, especially for males.

Biological theories for explaining the causes of sexual orientation are favored by scientists. These factors, which may be related to the development of a sexual orientation, include genes, the early uterine environment (such as prenatal hormones), and brain structure.

Scientific research and studies

Fetal development and hormones

The influence of hormones on the developing fetus has been the most influential causal hypothesis of the development of sexual orientation. In simple terms, the developing fetal brain begins in a "female" state. Both the INAH3 (third interstitial nucleus of the anterior hypothalamus) area on the left side of the Hypothalamus, which stores gender preference, and the center area of the Bed Stria Terminalus (BSTc) area on the right side of the Hypothalamus, which stores gender identity, are small and function as female. The action of the SRY gene in the Y-chromosome in the fetus prompts the development of testes, which release testosterone, the primary androgen receptor-activating hormone, to allow testosterone to enter the cells and masculinize the fetus and fetal brain. If the proper amount of testosterone is received by the INAH3 while it is constructed at 12 weeks after conception, the testosterone overrides the estrogen that is also present, and enlarges the INAH3 which is known to be involved in directing typical male sex behavior, such as attraction to females. If the INAH3 does not receive an overriding amount of testosterone, in may not grow normal size for males, and may function as female or partially female, causein same-sex attraction to males.

It has been shown that the INAH3 in gay men has likely been exposed to low levels of testosterone in the of the brain compared to straight men, or had different levels of receptivity to its masculinizing effects, or experienced hormone fluctuations at critical times during fetal development. In women, if the INAH3 receives more testosterone than is normal for females, the INAH3 may enlarge somewhat or even to the size that is normal for males, increasing the likelihood of same sex attraction. Supporting this are studies of the finger digit ratio of the right hand, which is a robust marker of prenatal testosterone exposure. Lesbians on average, have significantly more masculine digit ratios, a finding which has been replicated numerous times in studies cross-culturally. While direct effects are hard to measure for ethical reasons, animal experiments where scientists manipulate exposure to sex hormones during gestation can also induce lifelong male-typical behavior and mounting in female animals, and female-typical behavior in male animals.

Maternal immune responses during fetal development are strongly demonstrated as causing male homosexuality and bisexuality. Research since the 1990s has demonstrated that the more sons a woman has, there is a higher chance of later born sons being gay. During pregnancy, male cells enter a mother's bloodstream, which are foreign to her immune system. In response, she develops antibodies to neutralize them. These antibodies are then released on future male foetuses and may neutralize Y-linked antigens, which play a role in brain masculinization, leaving areas of the brain responsible for sexual attraction in the female-typical position, or attracted to men. The more sons a mother has will increase the levels of these antibodies, thus creating the observed fraternal birth order effect. Biochemical evidence to support this effect was confirmed in a lab study in 2017, finding that mothers with a gay son, particularly those with older brothers, had heightened levels of antibodies to the NLGN4Y Y-protein than mothers with heterosexual sons. J. Michael Bailey has described maternal immune responses as "causal" of male homosexuality. This effect is estimated to account for between 15 and 29% of gay men, while other gay and bisexual men are thought to owe sexual orientation to genetic and hormonal interactions.

Socialization theories, which were dominant in the 1900s, favored the idea that children were born "undifferentiated" and were socialized into gender roles and sexual orientation. This led to medical experiments in which newborn and infant boys were surgically reassigned into girls after accidents such as botched circumcisions. These males were then reared and raised as females without telling the boys, which, contrary to expectations, did not make them feminine nor attracted to men. All published cases providing sexual orientation grew up to be strongly attracted to women. The failure of these experiments demonstrate that socialization effects do not induce feminine type behavior in males, nor make them attracted to men, and that the organizational effects of hormones on the fetal brain prior to birth have permanent effects. These are indicative of 'nature', not nurture, at least with regards to male sexual orientation.

The sexually dimorphic nucleus of the preoptic area (SDN-POA) is a key region of the brain which differs between males and females in humans and a number of mammals (e.g., sheep/rams, mice, rats), and is caused by sex differences in hormone exposure. The INAH-3 region is bigger in males than in females, and is known to be a critical region in sexual behavior. Dissection studies found that gay men had significantly smaller sized INAH-3 than heterosexual males, which is shifted in the female typical direction, a finding first demonstrated by neuroscientist Simon LeVay, which has been replicated. Dissection studies are rare, however, due to lack of funding and brain samples.

Long-term studies of domesticated sheep led by Charles Roselli have found that 6-8% of rams have a homosexual preference through their life. Dissection of ram brains also found a similar smaller (feminized) structure in homosexually oriented rams compared to heterosexually oriented rams in the equivalent brain region to the human SDN, the ovine sexually dimorphic nucleus (oSDN). The size of the sheep oSDN has also been demonstrated to be formed in utero, rather than postnatally, underscoring the role of prenatal hormones in masculinization of the brain for sexual attraction.

Other studies in humans have relied on brain imaging technology, such as research led by Ivanka Savic which compared hemispheres of the brain. This research found that straight men had right hemispheres 2% larger than the left, described as modest but "highly significant difference" by LeVay. In heterosexual women, the two hemispheres were the same size. In gay men, the two hemispheres were also the same size, or sex atypical, while in lesbians, the right hemispheres were slightly larger than the left, indicating a small shift in the male direction.

A model proposed by evolutionary geneticist William R. Rice argues that a misexpressed epigenetic modifier of testosterone sensitivity or insensitivity that affected development of the brain can explain homosexuality, and can best explain twin discordance. Rice et al. propose that these epimarks normally canalize sexual development, preventing intersex conditions in most of the population, but sometimes failing to erase across generations and causing reversed sexual preference. On grounds of evolutionary plausibility, Gavrilets, Friberg and Rice argue that all mechanisms for exclusive homosexual orientations likely trace back to their epigenetic model. Testing this hypothesis is possible with current stem cell technology.

Prenatal thyroid theory

Prenatal thyroid theory of same-sex attraction/gender dysphoria has been based on clinical and developmental observations of youngsters presenting to child psychiatry clinics in Istanbul/Turkey. The report of 12 cases with same-sex attraction/gender dysphoria born to mothers with thyroid diseases was first presented in EPA Congress, Vienna (2015) and published as an article in the same year. The extremely significant relationship between the two conditions suggested an independent model, named as Prenatal Thyroid Model of Homosexuality. According to Turkish child & adolescent psychiatrist Osman Sabuncuoglu, who generated the theory, maternal thyroid dysfunction may lead to abnormal deviations from gender-specific development in the offspring. Autoimmune destructive process as seen in Hashimoto thyroiditis, diminished supply of thyroid hormones and impacts on prenatal androgen system were all considered as contributing mechanisms. In a follow-up theoretical paper, previous research findings indicating higher rates of  polycystic ovary syndrome (PCOS) in female-to-male transsexuals and lesbian women were conceived as an indication of Prenatal Thyroid Model since PCOS and autoimmune thyroiditis are frequently comorbid diseases. Likewise, increased rates of autism spectrum disorder in children born to mothers with thyroid dysfunction and overrepresentation of ASD individuals in gender dysphoria populations suggest such an association. A second group of young children with this pattern were presented in IACAPAP Congress, Prague (2018).

The findings from previous research in LGBT populations had called for attention to be paid to thyroid system. A commentary by Jeffrey Mullen, published shortly after the 2015 article, underlined the importance of Prenatal Thyroid Model and supported developments in this field. Afterwards, several authors have emphasized the role of thyroid system in sexuality while citing the Prenatal Thyroid Model. Among them, Carosa et al. concluded that thyroid hormones, affecting the human sexual function strongly, the thyroid gland must be considered, along with the genitals and the brain, a sexual organ. As a tertiary source, an authoritative book on the subject of interplay between endocrinology, brain and behavior has also cited the thyroid-homosexuality proposal article in the latest edition. Most importantly, a genome-wide genetic association study on male homosexuals identified a significant region on Chromosome 14 which is related to autoimmune thyroid dysfunction in human beings. This is apparently a big support to the Prenatal Thyroid Model.

Genetic influences

Multiple genes have been found to play a role in sexual orientation. Scientists caution that many people misconstrue the meanings of genetic and environmental. Environmental influence does not automatically imply that the social environment influences or contributes to the development of sexual orientation. Hypotheses for the impact of the post-natal social environment on sexual orientation are weak, especially for males. There is, however, a vast non-social environment that is non-genetic yet still biological, such as prenatal development, that likely helps shape sexual orientation.

Twin studies

Identical twins are more likely to have the same sexual orientation than non-identical twins. This indicates that genes have some influence on sexual orientation; however, scientists have found evidence that other events in the womb play a role. Twins may have separate amniotic sacs and placentas, resulting in different exposure and timing of hormones.

A number of twin studies have attempted to compare the relative importance of genetics and environment in the determination of sexual orientation. In a 1991 study, Bailey and Pillard conducted a study of male twins recruited from "homophile publications", and found that 52% of monozygotic (MZ) brothers (of whom 59 were questioned) and 22% of the dizygotic (DZ) twins were concordant for homosexuality. 'MZ' indicates identical twins with the same sets of genes and 'DZ' indicates fraternal twins where genes are mixed to an extent similar to that of non-twin siblings. In a study of 61 pairs of twins, researchers found among their mostly male subjects a concordance rate for homosexuality of 66% among monozygotic twins and a 30% one among dizygotic twins. In 2000, Bailey, Dunne and Martin studied a larger sample of 4,901 Australian twins but reported less than half the level of concordance. They found 20% concordance in the male identical or MZ twins and 24% concordance for the female identical or MZ twins. Self reported zygosity, sexual attraction, fantasy and behaviours were assessed by questionnaire and zygosity was serologically checked when in doubt. Other researchers support biological causes for both men and women's sexual orientation.

A 2008 study of all adult twins in Sweden (more than 7,600 twins) found that same-sex behaviour was explained by both heritable genetic factors and unique environmental factors (which can include the prenatal environment during gestation, exposure to illness in early life, peer groups not shared with a twin, etc.), although a twin study cannot identify which factor is at play. Influences of the shared environment (influences including the family environment, rearing, shared peer groups, culture and societal views, and sharing the same school and community) had no effect for men, and a weak effect for women. This is consistent with the common finding that parenting and culture appears to play no role in male sexual orientation, but may play some small role in women. The study concludes that genetic influences on any lifetime same-sex partner were stronger for men than women, and that "it has been suggested individual differences in heterosexual and homosexual behavior result from unique environmental factors such as prenatal exposure to sex hormones, progressive maternal immunization to sex-specific proteins, or neurodevelopmental factors", although does not rule out other variables. The use of all adult twins in Sweden was designed to address the criticism of volunteer studies, in which a potential bias towards participation by gay twins may influence the results:

Biometric modeling revealed that, in men, genetic effects explained .34–.39 of the variance [of sexual orientation], the shared environment .00, and the individual-specific environment .61–.66 of the variance. Corresponding estimates among women were .18–.19 for genetic factors, .16–.17 for shared environmental, and .64–.66 for unique environmental factors. Although wide confidence intervals suggest cautious interpretation, the results are consistent with moderate, primarily genetic, familial effects, and moderate to large effects of the nonshared environment (social and biological) on same-sex sexual behavior.

Chromosome linkage studies

Chromosome linkage studies of sexual orientation have indicated the presence of multiple contributing genetic factors throughout the genome. In 1993, Dean Hamer and colleagues published findings from a linkage analysis of a sample of 76 gay brothers and their families. Hamer et al. found that the gay men had more gay male uncles and cousins on the maternal side of the family than on the paternal side. Gay brothers who showed this maternal pedigree were then tested for X chromosome linkage, using twenty-two markers on the X chromosome to test for similar alleles. In another finding, thirty-three of the forty sibling pairs tested were found to have similar alleles in the distal region of Xq28, which was significantly higher than the expected rates of 50% for fraternal brothers. This was popularly dubbed the "gay gene" in the media, causing significant controversy. In 1998, Sanders et al. reported on their similar study, in which they found that 13% of uncles of gay brothers on the maternal side were homosexual, compared with 6% on the paternal side.

A later analysis by Hu et al. replicated and refined the earlier findings. This study revealed that 67% of gay brothers in a new saturated sample shared a marker on the X chromosome at Xq28. Two other studies (Bailey et al., 1999; McKnight and Malcolm, 2000) failed to find a preponderance of gay relatives in the maternal line of homosexual men. One study by Rice et al. in 1999 failed to replicate the Xq28 linkage results. Meta-analysis of all available linkage data indicates a significant link to Xq28, but also indicates that additional genes must be present to account for the full heritability of sexual orientation.

Mustanski et al. (2005) performed a full-genome scan (instead of just an X chromosome scan) on individuals and families previously reported on in Hamer et al. (1993) and Hu et al. (1995), as well as additional new subjects. In the full sample they did not find linkage to Xq28.

Results from the first large, comprehensive multi-center genetic linkage study of male sexual orientation were reported by an independent group of researchers at the American Society of Human Genetics in 2012. The study population included 409 independent pairs of gay brothers, who were analyzed with over 300,000 single-nucleotide polymorphism markers. The data strongly replicated Hamer's Xq28 findings as determined by both two-point and multipoint (MERLIN) LOD score mapping. Significant linkage was also detected in the pericentromeric region of chromosome 8, overlapping with one of the regions detected in the Hamer lab's previous genomewide study. The authors concluded that "our findings, taken in context with previous work, suggest that genetic variation in each of these regions contributes to development of the important psychological trait of male sexual orientation". Female sexual orientation does not seem to be linked to Xq28, though it does appear moderately heritable.

In addition to sex chromosomal contribution, a potential autosomal genetic contribution to the development of homosexual orientation has also been suggested. In a study population composed of more than 7000 participants, Ellis et al. (2008) found a statistically significant difference in the frequency of blood type A between homosexuals and heterosexuals. They also found that "unusually high" proportions of homosexual males and homosexual females were Rh negative in comparison to heterosexuals. As both blood type and Rh factor are genetically inherited traits controlled by alleles located on chromosome 9 and chromosome 1 respectively, the study indicates a potential link between genes on autosomes and homosexuality.

The biology of sexual orientation has been studied in detail in several animal model systems. In the common fruit fly Drosophila melanogaster, the complete pathway of sexual differentiation of the brain and the behaviors it controls is well established in both males and females, providing a concise model of biologically controlled courtship. In mammals, a group of geneticists at the Korea Advanced Institute of Science and Technology bred a female mice specifically lacking a particular gene related to sexual behavior. Without the gene, the mice exhibited masculine sexual behavior and attraction toward urine of other female mice. Those mice who retained the gene fucose mutarotase (FucM) were attracted to male mice.

In interviews to the press, researchers have pointed that the evidence of genetic influences should not be equated with genetic determinism. According to Dean Hamer and Michael Bailey, genetic aspects are only one of the multiple causes of homosexuality.

In 2017, Scientific Reports published an article with a genome wide association study on male sexual orientation. The research consisted of 1,077 homosexual men and 1,231 heterosexual men. A gene named SLITRK6 on chromosome 13 was identified. The research supports another study which had been done by the neuroscientist Simon LeVay. LeVay's research suggested that the hypothalamus of gay men is different from straight men. The SLITRK6 is active in the mid-brain where the hypothalamus is. The researchers found that the thyroid stimulating hormone receptor (TSHR) on chromosome 14 shows sequence differences between gay and straight men. Graves' disease is associated with TSHR abnormalities, with previous research indicating that Graves' disease is more common in gay men than in straight men. Research indicated that gay people have lower body weight than straight people. It had been suggested that the overactive TSHR hormone lowered body weight in gay people, though this remains unproven.

In 2018, Ganna et al. performed another genome-wide association study on sexual orientation of men and women with data from 26,890 people who had at least one same-sex partner and 450,939 controls. The data in the study was meta-analyzed and obtained from the UK Biobank study and 23andMe. The researchers identified four variants more common in people who reported at least one same-sex experience on chromosomes 7, 11, 12, and 15. The variants on chromosomes 11 and 15 were specific to men, with the variant on chromosome 11 located in an olfactory gene and the variant on chromosome 15 having previously been linked to male-pattern baldness. The four variants were also correlated with mood and mental health disorders; major depressive disorder and schizophrenia in men and women, and bipolar disorder in women. However, none of the four variants could reliably predict sexual orientation.

In August 2019, a genome-wide association study of 493,001 individuals concluded that hundreds or thousands of genetic variants underlie homosexual behavior in both sexes, with 5 variants in particular being significantly associated. Some of these variants had sex-specific effects, and two of these variants suggested links to biological pathways that involve sex hormone regulation and olfaction. All the variants together captured between 8 and 25% of the variation in individual differences in homosexual behavior. These genes partly overlap with those for several other traits, including openness to experience and risk-taking behavior. Additional analyses suggested that sexual behavior, attraction, identity, and fantasies are influenced by a similar set of genetic variants. They also found that the genetic effects that differentiate heterosexual from homosexual behavior are not the same as those that differ among nonheterosexuals with lower versus higher proportions of same-sex partners, which suggests that there is no single continuum from heterosexual to homosexual preference, as suggested by the Kinsey scale.

In October 2021, another research paper reported that genetic factors influence the development of same-sex sexual behavior. A two-stage genome-wide association study (GWAS) with a total sample of 1478 homosexual males and 3313 heterosexual males in Han Chinese populations identified two genetic loci (FMR1NB and ZNF536) showing consistent association with male sexual orientation.

Epigenetics studies

A study suggests linkage between a mother's genetic make-up and homosexuality of her sons. Women have two X chromosomes, one of which is "switched off". The inactivation of the X chromosome occurs randomly throughout the embryo, resulting in cells that are mosaic with respect to which chromosome is active. In some cases though, it appears that this switching off can occur in a non-random fashion. Bocklandt et al. (2006) reported that, in mothers of homosexual men, the number of women with extreme skewing of X chromosome inactivation is significantly higher than in mothers without gay sons. 13% of mothers with one gay son, and 23% of mothers with two gay sons, showed extreme skewing, compared to 4% of mothers without gay sons.

Birth order

Blanchard and Klassen (1997) reported that each additional older brother increases the odds of a man being gay by 33%. This is now "one of the most reliable epidemiological variables ever identified in the study of sexual orientation". To explain this finding, it has been proposed that male fetuses provoke a maternal immune reaction that becomes stronger with each successive male fetus. This maternal immunization hypothesis (MIH) begins when cells from a male fetus enter the mother's circulation during pregnancy or while giving birth. Male fetuses produce H-Y antigens which are "almost certainly involved in the sexual differentiation of vertebrates". These Y-linked proteins would not be recognized in the mother's immune system because she is female, causing her to develop antibodies which would travel through the placental barrier into the fetal compartment. From here, the anti-male bodies would then cross the blood/brain barrier (BBB) of the developing fetal brain, altering sex-dimorphic brain structures relative to sexual orientation, increasing the likelihood that the exposed son will be more attracted to men than women. It is this antigen which maternal H-Y antibodies are proposed to both react to and 'remember'. Successive male fetuses are then attacked by H-Y antibodies which somehow decrease the ability of H-Y antigens to perform their usual function in brain masculinization.

In 2017, researchers discovered a biological mechanism of gay people who tend to have older brothers. They think Neuroligin 4 Y-linked protein is responsible for a later son being gay. They found that women had significantly higher anti-NLGN4Y levels than men. In addition, mothers of gay sons, particularly those with older brothers, had significantly higher anti-NLGN4Y levels than did the control samples of women, including mothers of heterosexual sons. The results suggest an association between a maternal immune response to NLGN4Y and subsequent sexual orientation in male offspring.

The fraternal birth order effect, however, does not apply to instances where a firstborn is homosexual.

Female fertility

In 2004, Italian researchers conducted a study of about 4,600 people who were the relatives of 98 homosexual and 100 heterosexual men. Female relatives of the homosexual men tended to have more offspring than those of the heterosexual men. Female relatives of the homosexual men on their mother's side tended to have more offspring than those on the father's side. The researchers concluded that there was genetic material being passed down on the X chromosome which both promote fertility in the mother and homosexuality in her male offspring. The connections discovered would explain about 20% of the cases studied, indicating that it being a highly significant factor to account for, but not the sole genetic factor determining sexual orientation.

Pheromone studies

Research conducted in Sweden has suggested that gay and straight men respond differently to two odors that are believed to be involved in sexual arousal. The research showed that when both heterosexual women and gay men are exposed to a testosterone derivative found in men's sweat, a region in the hypothalamus is activated. Heterosexual men, on the other hand, have a similar response to an estrogen-like compound found in women's urine. The conclusion is that sexual attraction, whether same-sex or opposite-sex oriented, operates similarly on a biological level. Researchers have suggested that this possibility could be further explored by studying young subjects to see if similar responses in the hypothalamus are found and then correlating these data with adult sexual orientation.

Studies of brain structure

A number of sections of the brain have been reported to be sexually dimorphic; that is, they vary between men and women. There have also been reports of variations in brain structure corresponding to sexual orientation. In 1990, Dick Swaab and Michel A. Hofman reported a difference in the size of the suprachiasmatic nucleus between homosexual and heterosexual men. In 1992, Allen and Gorski reported a difference related to sexual orientation in the size of the anterior commissure, but this research was refuted by numerous studies, one of which found that the entirety of the variation was caused by a single outlier.

Research on the physiologic differences between male and female brains are based on the idea that people have male or a female brain, and this mirrors the behavioral differences between the two sexes. Some researchers state that solid scientific support for this is lacking. Although consistent differences have been identified, including the size of the brain and of specific brain regions, male and female brains are very similar.

Sexually dimorphic nuclei in the anterior hypothalamus

LeVay also conducted some of these early researches. He studied four groups of neurons in the hypothalamus called INAH1, INAH2, INAH3 and INAH4. This was a relevant area of the brain to study, because of evidence that it played a role in the regulation of sexual behaviour in animals, and because INAH2 and INAH3 had previously been reported to differ in size between men and women.

He obtained brains from 41 deceased hospital patients. The subjects were classified into three groups. The first group comprised 19 gay men who had died of AIDS-related illnesses. The second group comprised 16 men whose sexual orientation was unknown, but whom the researchers presumed to be heterosexual. Six of these men had died of AIDS-related illnesses. The third group was of six women whom the researchers presumed to be heterosexual. One of the women had died of an AIDS-related illness.

The HIV-positive people in the presumably heterosexual patient groups were all identified from medical records as either intravenous drug abusers or recipients of blood transfusions. Two of the men who identified as heterosexual specifically denied ever engaging in a homosexual sex act. The records of the remaining heterosexual subjects contained no information about their sexual orientation; they were assumed to have been primarily or exclusively heterosexual "on the basis of the numerical preponderance of heterosexual men in the population".

LeVay found no evidence for a difference between the groups in the size of INAH1, INAH2 or INAH4. However, the INAH3 group appeared to be twice as big in the heterosexual male group as in the gay male group; the difference was highly significant, and remained significant when only the six AIDS patients were included in the heterosexual group. The size of INAH3 in the homosexual men's brains was comparable to the size of INAH3 in the heterosexual women's brains.

William Byne and colleagues attempted to identify the size differences reported in INAH 1–4 by replicating the experiment using brain sample from other subjects: 14 HIV-positive homosexual males, 34 presumed heterosexual males (10 HIV-positive), and 34 presumed heterosexual females (9 HIV-positive). The researchers found a significant difference in INAH3 size between heterosexual men and heterosexual women. The INAH3 size of the homosexual men was apparently smaller than that of the heterosexual men, and larger than that of the heterosexual women, though neither difference quite reached statistical significance.

Byne and colleagues also weighed and counted numbers of neurons in INAH3 tests not carried out by LeVay. The results for INAH3 weight were similar to those for INAH3 size; that is, the INAH3 weight for the heterosexual male brains was significantly larger than for the heterosexual female brains, while the results for the gay male group were between those of the other two groups but not quite significantly different from either. The neuron count also found a male-female difference in INAH3, but found no trend related to sexual orientation.

LeVay has said that Byne replicated his work, but that he employed a two-tailed statistical analysis, which is typically reserved for when no previous findings had employed the difference. LeVay has said that "given that my study had already reported a INAH3 to be smaller in gay men, a one tailed approach would have been more appropriate, and it would have yielded a significant difference [between heterosexual and homosexual men]".

J. Michael Bailey has criticized LeVay's critics—describing the claim that the INAH-3 difference could be attributable to AIDS as "aggravating", since the "INAH-3 did not differ between the brains of straight men who died of AIDS and those who did not have the disease". Bailey has further criticized the second objection that was raised, that being gay might have somehow caused the difference in INAH-3, and not vice-versa, saying "the problem with this idea is that the hypothalamus appears to develop early. Not a single expert I have ever asked about LeVay's study thought it was plausible that sexual behavior caused the INAH-3 differences."

The SCN of homosexual males has been demonstrated to be larger (both the volume and the number of neurons are twice as many as in heterosexual males). These areas of the hypothalamus have not yet been explored in homosexual females nor bisexual males nor females. Although the functional implications of such findings still have not been examined in detail, they cast serious doubt over the widely accepted Dörner hypothesis that homosexual males have a "female hypothalamus" and that the key mechanism of differentiating the "male brain from originally female brain" is the epigenetic influence of testosterone during prenatal development.

A 2010 study by Garcia-Falgueras and Swaab stated that "the fetal brain develops during the intrauterine period in the male direction through a direct action of testosterone on the developing nerve cells, or in the female direction through the absence of this hormone surge. In this way, our gender identity (the conviction of belonging to the male or female gender) and sexual orientation are programmed or organized into our brain structures when we are still in the womb. There is no indication that social environment after birth has an effect on gender identity or sexual orientation."

Ovine model

The domestic ram is used as an experimental model to study early programming of the neural mechanisms which underlie homosexuality, developing from the observation that approximately 8% of domestic rams are sexually attracted to other rams (male-oriented) when compared to the majority of rams which are female-oriented. In many species, a prominent feature of sexual differentiation is the presence of a sexually dimorphic nucleus (SDN) in the preoptic hypothalamus, which is larger in males than in females.

Roselli et al. discovered an ovine SDN (oSDN) in the preoptic hypothalamus that is smaller in male-oriented rams than in female-oriented rams, but similar in size to the oSDN of females. Neurons of the oSDN show aromatase expression which is also smaller in male-oriented rams versus female-oriented rams, suggesting that sexual orientation is neurologically hard-wired and may be influenced by hormones. However, results failed to associate the role of neural aromatase in the sexual differentiation of brain and behavior in the sheep, due to the lack of defeminization of adult sexual partner preference or oSDN volume as a result of aromatase activity in the brain of the fetuses during the critical period. Having said this, it is more likely that oSDN morphology and homosexuality may be programmed through an androgen receptor that does not involve aromatisation. Most of the data suggests that homosexual rams, like female-oriented rams, are masculinized and defeminized with respect to mounting, receptivity, and gonadotrophin secretion, but are not defeminized for sexual partner preferences, also suggesting that such behaviors may be programmed differently. Although the exact function of the oSDN is not fully known, its volume, length, and cell number seem to correlate with sexual orientation, and a dimorphism in its volume and of cells could bias the processing cues involved in partner selection. More research is needed in order to understand the requirements and timing of the development of the oSDN and how prenatal programming effects the expression of mate choice in adulthood.

Childhood gender nonconformity

Childhood gender nonconformity is a strong predictor of adult sexual orientation that has been consistently replicated in research, and is thought to be strong evidence of a biological difference between heterosexual and non-heterosexuals. A review authored by J. Michael Bailey states: "childhood gender nonconformity comprises the following phenomena among boys: cross-dressing, desiring to have long hair, playing with dolls, disliking competitive sports and rough play, preferring girls as playmates, exhibiting elevated separation anxiety, and desiring to be—or believing that one is—a girl. In girls, gender nonconformity comprises dressing like and playing with boys, showing interest in competitive sports and rough play, lacking interest in conventionally female toys such as dolls and makeup, and desiring to be a boy". This gender nonconformist behavior typically emerges at preschool age, although is often evident as early as age 2. Children are only considered gender nonconforming if they persistently engage in a variety of these behaviors, as opposed to engaging in a behavior on a few times or on occasion. It is also not a one-dimensional trait, but rather has varying degrees.

Children who grow up to be non-heterosexual were, on average, substantially more gender nonconforming in childhood. This is confirmed in both retrospective studies where homosexuals, bisexuals and heterosexuals are asked about their gender typical behavior in childhood, and in prospective studies, where highly gender nonconforming children are followed from childhood into adulthood to find out their sexual orientation. A review of retrospective studies that measured gender nonconforming traits estimated that 89% of homosexual men exceeded heterosexual males level of gender nonconformity, whereas just 2% of heterosexual men exceeded the homosexual median. For female sexual orientation, the figures were 81% and 12% respectively. A variety of other assessments such as childhood home videos, photos and reports of parents also confirm this finding. Critics of this research see this as confirming stereotypes; however, no study has ever demonstrated that this research has exaggerated childhood gender nonconformity. J. Michael Bailey argues that gay men often deny that they were gender nonconforming in childhood because they may have been bullied or maltreated by peers and parents for it, and because they often do not find femininity attractive in other gay males and thus would not want to acknowledge it in themselves. Additional research in Western cultures and non-Western cultures including Latin America, Asia, Polynesia, and the Middle East supports the validity of childhood gender nonconformity as a predictor of adult non-heterosexuality.

This research does not mean that all non-heterosexuals were gender nonconforming, but rather indicates that long before sexual attraction is known, non-heterosexuals, on average, are noticeably different from other children. There is little evidence that gender nonconforming children have been encouraged or taught to behave that way; rather, childhood gender nonconformity typically emerges despite conventional socialization. Medical experiments in which infant boys were sex reassigned and reared as girls did not make them feminine nor attracted to males.

Boys who were surgically reassigned female

Between the 1960s and 2000, many newborn and infant boys were surgically reassigned as females if they were born with malformed penises, or if they lost their penises in accidents. Many surgeons believed such males would be happier being socially and surgically reassigned female. In all seven published cases that have provided sexual orientation information, the subjects grew up to be attracted to females. Six cases were exclusively attracted to females, with one case 'predominantly' attracted to females. In a review article in the journal Psychological Science in the Public Interest, six researchers including J. Michael Bailey state this establishes a strong case that male sexual orientation is partly established before birth:

This is the result we would expect if male sexual orientation were entirely due to nature, and it is opposite of the result expected if it were due to nurture, in which case we would expect that none of these individuals would be predominantly attracted to women. They show how difficult it is to derail the development of male sexual orientation by psychosocial means.

They further argue that this raises questions about the significance of the social environment on sexual orientation, stating, "If one cannot reliably make a male human become attracted to other males by cutting off his penis in infancy and rearing him as a girl, then what other psychosocial intervention could plausibly have that effect?" It is further stated that neither cloacal exstrophy (resulting in a malformed penis), nor surgical accidents, are associated with abnormalities of prenatal androgens, thus, the brains of these individuals were male-organized at birth. Six of the seven identified as heterosexual males at follow up, despite being surgically altered and reared as females, with researchers adding: "available evidence indicates that in such instances, parents are deeply committed to raising these children as girls and in as gender-typical a manner as possible." Bailey et al. describe these sex reassignments as 'the near-perfect quasi-experiment' in measuring the impact of 'nature' versus 'nurture' with regards to male homosexuality.

'Exotic becomes erotic' theory

Daryl Bem, a social psychologist at Cornell University, has theorized that the influence of biological factors on sexual orientation may be mediated by experiences in childhood. A child's temperament predisposes the child to prefer certain activities over others. Because of their temperament, which is influenced by biological variables such as genetic factors, some children will be attracted to activities that are commonly enjoyed by other children of the same gender. Others will prefer activities that are typical of another gender. This will make a gender-conforming child feel different from opposite-gender children, while gender-nonconforming children will feel different from children of their own gender. According to Bem, this feeling of difference will evoke psychological arousal when the child is near members of the gender which it considers as being 'different'. Bem theorizes that this psychological arousal will later be transformed into sexual arousal: children will become sexually attracted to the gender which they see as different ("exotic"). This proposal is known as the "exotic becomes erotic" theory. Wetherell et al. state that Bem "does not intend his model as an absolute prescription for all individuals, but rather as a modal or average explanation."

Two critiques of Bem's theory in the journal Psychological Review concluded that "studies cited by Bem and additional research show that [the] Exotic Becomes Erotic theory is not supported by scientific evidence." Bem was criticized for relying on a non-random sample of gay men from the 1970s (rather than collecting new data) and for drawing conclusions that appear to contradict the original data. An "examination of the original data showed virtually all respondents were familiar with children of both sexes", and that only 9% of gay men said that "none or only a few" of their friends were male, and most gay men (74%) reported having "an especially close friend of the same sex" during grade school. Further, "71% of gay men reported feeling different from other boys, but so did 38% of heterosexual men. The difference for gay men is larger, but still indicates that feeling different from same-sex peers was common for heterosexual men." Bem also acknowledged that gay men were more likely to have older brothers (the fraternal birth order effect), which appeared to contradict an unfamiliarity with males. Bem cited cross-cultural studies which also "appear to contradict the EBE theory assertion", such as the Sambia tribe in Papua New Guinea, which ritually enforced homosexual acts among teenagers; yet once these boys reached adulthood, only a small proportion of men continued to engage in homosexual behaviour - similar to levels observed in the United States. Additionally, Bem's model could be interpreted as implying that if one could change a child's behavior, one could change their sexual orientation, but most psychologists doubt this would be possible.

Neuroscientist Simon LeVay said that while Bem's theory was arranged in a "believable temporal order", that it ultimately "lacks empirical support". Social psychologist Justin Lehmiller stated that Bem's theory has received praise "for the way it seamlessly links biological and environmental influences" and that there "is also some support for the model in the sense that childhood gender nonconformity is indeed one of the strongest predictors of adult homosexuality", but that the validity of the model "has been questioned on numerous grounds and scientists have largely rejected it."

Sexual orientation and evolution

General

Sexual practices that significantly reduce the frequency of heterosexual intercourse also significantly decrease the chances of successful reproduction, and for this reason, they would appear to be maladaptive in an evolutionary context following a simple Darwinian model (competition amongst individuals) of natural selection—on the assumption that homosexuality would reduce this frequency. Several theories have been advanced to explain this contradiction, and new experimental evidence has demonstrated their feasibility.

Some scholars have suggested that homosexuality is indirectly adaptive, by conferring a reproductive advantage in a non-obvious way on heterosexual siblings or their children, a hypothesised instance of kin selection. By way of analogy, the allele (a particular version of a gene) which causes sickle-cell anemia when two copies are present, also confers resistance to malaria with a lesser form of anemia when one copy is present (this is called heterozygous advantage).

Brendan Zietsch of the Queensland Institute of Medical Research proposes the alternative theory that men exhibiting female traits become more attractive to females and are thus more likely to mate, provided the genes involved do not drive them to complete rejection of heterosexuality.

In a 2008 study, its authors stated that "There is considerable evidence that human sexual orientation is genetically influenced, so it is not known how homosexuality, which tends to lower reproductive success, is maintained in the population at a relatively high frequency." They hypothesized that "while genes predisposing to homosexuality reduce homosexuals' reproductive success, they may confer some advantage in heterosexuals who carry them". Their results suggested that "genes predisposing to homosexuality may confer a mating advantage in heterosexuals, which could help explain the evolution and maintenance of homosexuality in the population". However, in the same study, the authors noted that "nongenetic alternative explanations cannot be ruled out" as a reason for the heterosexual in the homosexual-heterosexual twin pair having more partners, specifically citing "social pressure on the other twin to act in a more heterosexual way" (and thus seek out a greater number of sexual partners) as an example of one alternative explanation. The study acknowledges that a large number of sexual partners may not lead to greater reproductive success, specifically noting there is an "absence of evidence relating the number of sexual partners and actual reproductive success, either in the present or in our evolutionary past".

The heterosexual advantage hypothesis was given strong support by the 2004 Italian study demonstrating increased fecundity in the female matrilineal relatives of gay men. As originally pointed out by Hamer, even a modest increase in reproductive capacity in females carrying a "gay gene" could easily account for its maintenance at high levels in the population.

Gay uncle hypothesis

The "gay uncle hypothesis" posits that people who themselves do not have children may nonetheless increase the prevalence of their family's genes in future generations by providing resources (e.g., food, supervision, defense, shelter) to the offspring of their closest relatives.

This hypothesis is an extension of the theory of kin selection, which was originally developed to explain apparent altruistic acts which seemed to be maladaptive. The initial concept was suggested by J. B. S. Haldane in 1932 and later elaborated by many others including John Maynard Smith, W. D. Hamilton and Mary Jane West-Eberhard. This concept was also used to explain the patterns of certain social insects where most of the members are non-reproductive.

Vasey and VanderLaan (2010) tested the theory on the Pacific island of Samoa, where they studied women, straight men, and the fa'afafine, men who prefer other men as sexual partners and are accepted within the culture as a distinct third gender category. Vasey and VanderLaan found that the fa'afafine said they were significantly more willing to help kin, yet much less interested in helping children who are not family, providing the first evidence to support the kin selection hypothesis.

The hypothesis is consistent with other studies on homosexuality, which show that it is more prevalent amongst both siblings and twins.

Vasey and VanderLaan (2011) provides evidence that if an adaptively designed avuncular male androphilic phenotype exists and its development is contingent on a particular social environment, then a collectivistic cultural context is insufficient, in and of itself, for the expression of such a phenotype.

Biological differences in gay men and lesbian women

Some studies have found correlations between physiology of people and their sexuality; these studies provide evidence which suggests that:

  • Gay men and straight women have, on average, equally proportioned brain hemispheres. Lesbian women and straight men have, on average, slightly larger right brain hemispheres.
  • The suprachiasmatic nucleus of the hypothalamus was found by Swaab and Hopffman to be larger in gay men than in non-gay men; the suprachiasmatic nucleus is also known to be larger in men than in women.
  • Gay men report, on average, slightly longer and thicker penises than non-gay men.
  • The average size of the INAH 3 in the brains of gay men is approximately the same size as INAH 3 in women, which is significantly smaller, and the cells more densely packed, than in heterosexual men's brains.
  • The anterior commissure was found to be larger in gay men than women and heterosexual men, but a subsequent study found no such difference.
  • The functioning of the inner ear and the central auditory system in lesbians and bisexual women are more like the functional properties found in men than in non-gay women (the researchers argued this finding was consistent with the prenatal hormonal theory of sexual orientation).
  • The startle response (eyeblink following a loud sound) is similarly masculinized in lesbians and bisexual women.
  • Gay and non-gay people's brains respond differently to two putative sex pheromones (AND, found in male armpit secretions, and EST, found in female urine).
  • The amygdala, a region of the brain, is more active in gay men than non-gay men when exposed to sexually arousing material.
  • Finger length ratios between the index and ring fingers have been reported to differ, on average, between non-gay and lesbian women.
  • Gay men and lesbians are significantly more likely to be left-handed or ambidextrous than non-gay men and women; Simon LeVay argues that because "[h]and preference is observable before birth... [t]he observation of increased non-right-handness in gay people is therefore consistent with the idea that sexual orientation is influenced by prenatal processes," perhaps heredity.
  • A study of over 50 gay men found that about 23% had counterclockwise hair whorl, as opposed to 8% in the general population. This may correlate with left-handedness.
  • Gay men have increased ridge density in the fingerprints on their left thumbs and little fingers.
  • Length of limbs and hands of gay men is smaller compared to height than the general population, but only among white men.

J. Michael Bailey has argued that the early childhood gender nonconforming behavior of homosexuals, as opposed to biological markers, are better evidence of homosexuality being an inborn trait. He argues that gay men are "punished much more than rewarded" for their childhood gender nonconformity, and that such behavior "emerges with no encouragement, and despite opposition", making it "the sine qua non of innateness".

Political aspects

Whether genetic or other physiological or psychological determinants form the basis of sexual orientation is a highly politicized issue. The Advocate, a U.S. gay and lesbian newsmagazine, reported in 1996 that 61% of its readers believed that "it would mostly help gay and lesbian rights if homosexuality were found to be biologically determined". A cross-national study in the United States, the Philippines, and Sweden found that those who believed that "homosexuals are born that way" held significantly more positive attitudes toward homosexuality than those who believed that "homosexuals choose to be that way" or "learn to be that way".

Equal protection analysis in U.S. law determines when government requirements create a "suspect classification" of groups and therefore eligible for heightened scrutiny based on several factors, one of which is immutability.

Evidence that sexual orientation is biologically determined (and therefore perhaps immutable in the legal sense) would strengthen the legal case for heightened scrutiny of laws discriminating on that basis.

The perceived causes of sexual orientation have a significant bearing on the status of sexual minorities in the eyes of social conservatives. The Family Research Council, a conservative Christian think tank in Washington, D.C., argues in the book Getting It Straight that finding people are born gay "would advance the idea that sexual orientation is an innate characteristic, like race; that homosexuals, like African-Americans, should be legally protected against 'discrimination;' and that disapproval of homosexuality should be as socially stigmatized as racism. However, it is not true." On the other hand, some social conservatives such as Reverend Robert Schenck have argued that people can accept any scientific evidence while still morally opposing homosexuality. National Organization for Marriage board member and fiction writer Orson Scott Card has supported biological research on homosexuality, writing that "our scientific efforts in regard to homosexuality should be to identify genetic and uterine causes... so that the incidence of this dysfunction can be minimized.... [However, this should not be seen] as an attack on homosexuals, a desire to 'commit genocide' against the homosexual community... There is no 'cure' for homosexuality because it is not a disease. There are, however, different ways of living with homosexual desires."

Some advocates for the rights of sexual minorities resist what they perceive as attempts to pathologise or medicalise 'deviant' sexuality, and choose to fight for acceptance in a moral or social realm. The journalist Chandler Burr has stated that "[s]ome, recalling earlier psychiatric "treatments" for homosexuality, discern in the biological quest the seeds of genocide. They conjure up the specter of the surgical or chemical "rewiring" of gay people, or of abortions of fetal homosexuals who have been hunted down in the womb." LeVay has said in response to letters from gays and lesbians making such criticisms that the research "has contributed to the status of gay people in society".

Equality (mathematics)

From Wikipedia, the free encyclopedia https://en.wikipedia.org/wiki/Equality_...