Search This Blog

Wednesday, December 6, 2023

Recent African origin of modern humans

From Wikipedia, the free encyclopedia
https://en.wikipedia.org/wiki/Recent_African_origin_of_modern_humans
Successive dispersals of
  Homo erectus greatest extent (yellow),
  Homo neanderthalensis greatest extent (ochre) and
  Homo sapiens (red).
Expansion of early modern humans from Africa through the Near East

In paleoanthropology, the recent African origin of modern humans or the "Out of Africa" theory (OOA) is the most widely accepted model of the geographic origin and early migration of anatomically modern humans (Homo sapiens). It follows the early expansions of hominins out of Africa, accomplished by Homo erectus and then Homo neanderthalensis.

The model proposes a "single origin" of Homo sapiens in the taxonomic sense, precluding parallel evolution in other regions of traits considered anatomically modern, but not precluding multiple admixture between H. sapiens and archaic humans in Europe and Asia. H. sapiens most likely developed in the Horn of Africa between 300,000 and 200,000 years ago, although an alternative hypothesis argues that diverse morphological features of H. sapiens appeared locally in different parts of Africa and converged due to gene flow between different populations within the same period. The "recent African origin" model proposes that all modern non-African populations are substantially descended from populations of H. sapiens that left Africa after that time.

There were at least several "out-of-Africa" dispersals of modern humans, possibly beginning as early as 270,000 years ago, including 215,000 years ago to at least Greece, and certainly via northern Africa and the Arabian Peninsula about 130,000 to 115,000 years ago. There is evidence that modern humans had reached China around 80,000 years ago. Practically all of these early waves seem to have gone extinct or retreated back, and present-day humans outside Africa descend mainly from a single expansion out 70,000–50,000 years ago.

The most significant "recent" wave out of Africa took place about 70,000–50,000 years ago, via the so-called "Southern Route", spreading rapidly along the coast of Asia and reaching Australia by around 65,000–50,000 years ago, (though some researchers question the earlier Australian dates and place the arrival of humans there at 50,000 years ago at earliest, while others have suggested that these first settlers of Australia may represent an older wave before the more significant out of Africa migration and thus not necessarily be ancestral to the region's later inhabitants) while Europe was populated by an early offshoot which settled the Near East and Europe less than 55,000 years ago.

In the 2010s, studies in population genetics uncovered evidence of interbreeding that occurred between H. sapiens and archaic humans in Eurasia, Oceania and Africa, indicating that modern population groups, while mostly derived from early H. sapiens, are to a lesser extent also descended from regional variants of archaic humans.

Proposed waves

Layer sequence at Ksar Akil in the Levantine corridor, and discovery of two fossils of Homo sapiens, dated to 40,800 to 39,200 years BP for "Egbert", and 42,400–41,700 BP for "Ethelruda".

"Recent African origin", or Out of Africa II, refers to the migration of anatomically modern humans (Homo sapiens) out of Africa after their emergence at c. 300,000 to 200,000 years ago, in contrast to "Out of Africa I", which refers to the migration of archaic humans from Africa to Eurasia from before 1.8 and up to 0.5 million years ago. Omo-Kibish I (Omo I) from southern Ethiopia is the oldest anatomically modern Homo sapiens skeleton currently known (around 233,000 years old). There are even older Homo sapiens fossils from Jebel Irhoud in Morocco which exhibit a mixture of modern and archaic features at around 315,000 years old.

Since the beginning of the 21st century, the picture of "recent single-origin" migrations has become significantly more complex, due to the discovery of modern-archaic admixture and the increasing evidence that the "recent out-of-Africa" migration took place in waves over a long time. As of 2010, there were two main accepted dispersal routes for the out-of-Africa migration of early anatomically modern humans, the "Northern Route" (via Nile Valley and Sinai) and the "Southern Route" via the Bab-el-Mandeb strait.

  • Posth et al. (2017) suggest that early Homo sapiens, or "another species in Africa closely related to us," might have first migrated out of Africa around 270,000 years ago based on the closer affinity within Neanderthals' mitochondrial genomes to Homo sapiens than Denisovans.
  • Fossil evidence also points to an early Homo sapiens migration with the oldest known fossil coming from Apidima Cave in Greece and dated at 210,000 years ago. Finds at Misliya cave, which include a partial jawbone with eight teeth, have been dated to around 185,000 years ago. Layers dating from between 250,000 and 140,000 years ago in the same cave contained tools of the Levallois type which could put the date of the first migration even earlier if the tools can be associated with the modern human jawbone finds.
  • An eastward dispersal from Northeast Africa to Arabia 150,000–130,000 years ago is based on the stone tools finds at Jebel Faya dated to 127,000 years ago (discovered in 2011), although fossil evidence in the area is significantly later at 85,000 years ago. Possibly related to this wave are the finds from Zhirendong cave, Southern China, dated to more than 100,000 years ago. Other evidence of modern human presence in China has been dated to 80,000 years ago.
  • The most significant out of Africa dispersal took place around 50,000–70,000 years ago via the so-called Southern Route, either before or after the Toba event, which happened between 69,000 and 77,000 years ago. This dispersal followed the southern coastline of Asia and reached Australia around 65,000–50,000 years ago or according to some research, by 50,000 years ago at earliest. Western Asia was "re-occupied" by a different derivation from this wave around 50,000 years ago and Europe was populated from Western Asia beginning around 43,000 years ago.
  • Wells (2003) describes an additional wave of migration after the southern coastal route, a northern migration into Europe about 45,000 years ago. This possibility is ruled out by Macaulay et al. (2005) and Posth et al. (2016), who argue for a single coastal dispersal, with an early offshoot into Europe.

Northern Route dispersal

Anatomically Modern Humans known archaeological remains in Europe and Africa, directly dated, calibrated carbon dates as of 2013.

Beginning 135,000 years ago, tropical Africa experienced megadroughts which drove humans from the land and towards the sea shores, and forced them to cross over to other continents.

Fossils of early Homo sapiens were found in Qafzeh and Es-Skhul Caves in Israel and have been dated to 80,000 to 120,000 years ago. These humans seem to have either become extinct or retreated back to Africa 70,000 to 80,000 years ago, possibly replaced by southbound Neanderthals escaping the colder regions of ice-age Europe. Hua Liu et al. analyzed autosomal microsatellite markers dating to about 56,000 years ago. They interpret the paleontological fossil as an isolated early offshoot that retracted back to Africa.

The discovery of stone tools in the United Arab Emirates in 2011 at the Faya-1 site in Mleiha, Sharjah, indicated the presence of modern humans at least 125,000 years ago, leading to a resurgence of the "long-neglected" North African route. This new understanding of the role of the Arabian dispersal began to change following results from archaeological and genetic studies stressing the importance of southern Arabia as a corridor for human expansions out of Africa.

In Oman, a site was discovered by Bien Joven in 2011 containing more than 100 surface scatters of stone tools belonging to the late Nubian Complex, known previously only from archaeological excavations in the Sudan. Two optically stimulated luminescence age estimates placed the Arabian Nubian Complex at approximately 106,000 years old. This provides evidence for a distinct Stone Age technocomplex in southern Arabia, around the earlier part of the Marine Isotope Stage 5.

According to Kuhlwilm and his co-authors, Neanderthals contributed genetically to modern humans then living outside of Africa around 100,000 years ago: humans which had already split off from other modern humans around 200,000 years ago, and this early wave of modern humans outside Africa also contributed genetically to the Altai Neanderthals. They found that "the ancestors of Neanderthals from the Altai Mountains and early modern humans met and interbred, possibly in the Near East, many thousands of years earlier than previously thought". According to co-author Ilan Gronau, "This actually complements archaeological evidence of the presence of early modern humans out of Africa around and before 100,000 years ago by providing the first genetic evidence of such populations." Similar genetic admixture events have been noted in other regions as well.

Southern Route dispersal

Coastal route

Red Sea crossing

By some 50–70,000 years ago, a subset of the bearers of mitochondrial haplogroup L3 migrated from East Africa into the Near East. It has been estimated that from a population of 2,000 to 5,000 individuals in Africa, only a small group, possibly as few as 150 to 1,000 people, crossed the Red Sea. The group that crossed the Red Sea travelled along the coastal route around Arabia and the Persian Plateau to India, which appears to have been the first major settling point. Wells (2003) argued for the route along the southern coastline of Asia, across about 250 kilometres (155 mi), reaching Australia by around 50,000 years ago.

Migration routes of modern humans, showing the northern route populating Western Eurasia, and the southern/coastal route populating Eastern Eurasia.

Today at the Bab-el-Mandeb straits, the Red Sea is about 20 kilometres (12 mi) wide, but 50,000 years ago sea levels were 70 m (230 ft) lower (owing to glaciation) and the water channel was much narrower. Though the straits were never completely closed, they were narrow enough to have enabled crossing using simple rafts, and there may have been islands in between. Shell middens 125,000 years old have been found in Eritrea, indicating that the diet of early humans included seafood obtained by beachcombing.

The dating of the Southern Dispersal is a matter of dispute. It may have happened either pre- or post-Toba, a catastrophic volcanic eruption that took place between 69,000 and 77,000 years ago at the site of present-day Lake Toba. Stone tools discovered below the layers of ash deposited in India may point to a pre-Toba dispersal but the source of the tools is disputed. An indication for post-Toba is haplo-group L3, that originated before the dispersal of humans out of Africa and can be dated to 60,000–70,000 years ago, "suggesting that humanity left Africa a few thousand years after Toba". Some research showing slower than expected genetic mutations in human DNA was published in 2012, indicating a revised dating for the migration to between 90,000 and 130,000 years ago. Some more recent research suggests a migration out-of-Africa of around 50,000-65,000 years ago of the ancestors of modern non-African populations, similar to most previous estimates.

Western Asia

Following the fossils dating 80,000 to 120,000 years ago from Qafzeh and Es-Skhul Caves in Israel there are no H. sapiens fossils in the Levant until the Manot 1 fossil from Manot Cave in Israel, dated to 54,700 years ago, though the dating was questioned by Groucutt et al. (2015). The lack of fossils and stone tool industries that can be safely associated with modern humans in the Levant has been taken to suggest that modern humans were outcompeted by Neanderthals until around 55,000 years ago, who would have placed a barrier on modern human dispersal out of Africa through the Northern Route. Climate reconstructions also support a Southern Route dispersal of modern humans as the Bab-el-Mandeb strait experienced a climate more conductive to human migration than the northern landbridge to the Levant during the major human dispersal out of Africa.

Oceania

It is thought that Australia was inhabited around 65,000–50,000 years ago. As of 2017, the earliest evidence of humans in Australia is at least 65,000 years old, while McChesney stated that

...genetic evidence suggests that a small band with the marker M168 migrated out of Africa along the coasts of the Arabian Peninsula and India, through Indonesia, and reached Australia very early, between 60,000 and 50,000 years ago. This very early migration into Australia is also supported by Rasmussen et al. (2011).

Fossils from Lake Mungo, Australia, have been dated to about 42,000 years ago. Other fossils from a site called Madjedbebe have been dated to at least 65,000 years ago, though some researchers doubt this early estimate and date the Madjedbebe fossils at about 50,000 years ago at the oldest.

Phylogenetic data suggests that an early Eastern Eurasian (Eastern non-African) meta-population trifurcated somewhere in eastern South Asia, and gave rise to the Australo-Papuans, the Ancient Ancestral South Indians (AASI), as well as East/Southeast Asians, although Papuans may have also received some gene flow from an earlier group (xOoA), around 2%, next to additional archaic admixture in the Sahul region.

According to one study, Papuans could have either formed from a mixture between an East Eurasian lineage and lineage basal to West and East Asians, or as a sister lineage of East Asians with or without a minor basal OoA or xOoA contribution.

A Holocene hunter-gatherer sample (Leang_Panninge) from South Sulawesi was found to be genetically in between East-Eurasians and Australo-Papuans. The sample could be modeled as ~50% Papuan-related and ~50% Basal-East Asian-related (Andamanese Onge or Tianyuan). The authors concluded that Basal-East Asian ancestry was far more widespread and the peopling of Insular Southeast Asia and Oceania was more complex than previously anticipated.

PCA calculated on present-day and ancient individuals from eastern Eurasia and Oceania. PC1 (23,8%) distinguishes East-Eurasians and Australo-Melanesians, while PC2 (6,3%) differentiates East-Eurasians along a North to South cline.
Principal component analysis (PCA) of ancient and modern day individuals from worldwide populations. Oceanians (Aboriginal Australians and Papuans) are most differentiated from both East-Eurasians and West-Eurasians.

East and Southeast Asia

In China, the Liujiang man (Chinese: 柳江人) is among the earliest modern humans found in East Asia. The date most commonly attributed to the remains is 67,000 years ago. High rates of variability yielded by various dating techniques carried out by different researchers place the most widely accepted range of dates with 67,000 BP as a minimum, but do not rule out dates as old as 159,000 BP.[79] Liu, Martinón-Torres et al. (2015) claim that modern human teeth have been found in China dating to at least 80,000 years ago.

Tianyuan man from China has a probable date range between 38,000 and 42,000 years ago, while Liujiang man from the same region has a probable date range between 67,000 and 159,000 years ago. According to 2013 DNA tests, Tianyuan man is related "to many present-day Asians and Native Americans". Tianyuan is similar in morphology to Liujiang man, and some Jōmon period modern humans found in Japan, as well as modern East and Southeast Asians.

A 2021 study about the population history of Eastern Eurasia, concluded that distinctive Basal-East Asian (East-Eurasian) ancestry originated in Mainland Southeast Asia at ~50,000BC from a distinct southern Himalayan route, and expanded through multiple migration waves southwards and northwards respectively.

Genetic studies concluded that Native Americans descended from a single founding population that initially split from a Basal-East Asian source population in Mainland Southeast Asia around 36,000 years ago, at the same time at which the proper Jōmon people split from Basal-East Asians, either together with Ancestral Native Americans or during a separate expansion wave. They also show that the basal northern and southern Native American branches, to which all other Indigenous peoples belong, diverged around 16,000 years ago. An indigenous American sample from 16,000BC in Idaho, which is craniometrically similar to modern Native Americans as well as Paleosiberias, was found to have largely East-Eurasian ancestry and showed high affinity with contemporary East Asians, as well as Jōmon period samples of Japan, confirming that Ancestral Native Americans split from an East-Eurasian source population in Eastern Siberia.

Europe

According to Macaulay et al. (2005), an early offshoot from the southern dispersal with haplogroup N followed the Nile from East Africa, heading northwards and crossing into Asia through the Sinai. This group then branched, some moving into Europe and others heading east into Asia. This hypothesis is supported by the relatively late date of the arrival of modern humans in Europe as well as by archaeological and DNA evidence. Based on an analysis of 55 human mitochondrial genomes (mtDNAs) of hunter-gatherers, Posth et al. (2016) argue for a "rapid single dispersal of all non-Africans less than 55,000 years ago."

Genetic reconstruction

Mitochondrial haplogroups

Within Africa

Map of early diversification of modern humans according to mitochondrial population genetics (see: Haplogroup L).

The first lineage to branch off from Mitochondrial Eve was L0. This haplogroup is found in high proportions among the San of Southern Africa and the Sandawe of East Africa. It is also found among the Mbuti people. These groups branched off early in human history and have remained relatively genetically isolated since then. Haplogroups L1, L2, and L3 are descendants of L1–L6, and are largely confined to Africa. The macro haplogroups M and N, which are the lineages of the rest of the world outside Africa, descend from L3. L3 is about 70,000 years old, while haplogroups M and N are about 65–55,000 years old. The relationship between such gene trees and demographic history is still debated when applied to dispersals.

Of all the lineages present in Africa, the female descendants of only one lineage, mtDNA haplogroup L3, are found outside Africa. If there had been several migrations, one would expect descendants of more than one lineage to be found. L3's female descendants, the M and N haplogroup lineages, are found in very low frequencies in Africa (although haplogroup M1 populations are very ancient and diversified in North and North-east Africa) and appear to be more recent arrivals. A possible explanation is that these mutations occurred in East Africa shortly before the exodus and became the dominant haplogroups thereafter by means of the founder effect. Alternatively, the mutations may have arisen shortly afterwards.

Southern Route and haplogroups M and N

Overview map of the peopling of the world by anatomically modern humans

Results from mtDNA collected from aboriginal Malaysians called Orang Asli indicate that the haplogroups M and N share characteristics with original African groups from approximately 85,000 years ago, and share characteristics with sub-haplogroups found in coastal south-east Asian regions, such as Australasia, the Indian subcontinent and throughout continental Asia, which had dispersed and separated from their African progenitor approximately 65,000 years ago. This southern coastal dispersal would have occurred before the dispersal through the Levant approximately 45,000 years ago. This hypothesis attempts to explain why haplogroup N is predominant in Europe and why haplogroup M is absent in Europe. Evidence of the coastal migration is thought to have been destroyed by the rise in sea levels during the Holocene epoch. Alternatively, a small European founder population that had expressed haplogroup M and N at first, could have lost haplogroup M through random genetic drift resulting from a bottleneck (i.e. a founder effect).

The group that crossed the Red Sea travelled along the coastal route around Arabia and Persia until reaching India. Haplogroup M is found in high frequencies along the southern coastal regions of Pakistan and India and it has the greatest diversity in India, indicating that it is here where the mutation may have occurred. Sixty percent of the Indian population belong to Haplogroup M. The indigenous people of the Andaman Islands also belong to the M lineage. The Andamanese are thought to be offshoots of some of the earliest inhabitants in Asia because of their long isolation from the mainland. They are evidence of the coastal route of early settlers that extends from India to Thailand and Indonesia all the way to eastern New Guinea. Since M is found in high frequencies in highlanders from New Guinea and the Andamanese and New Guineans have dark skin and Afro-textured hair, some scientists think they are all part of the same wave of migrants who departed across the Red Sea ~60,000 years ago in the Great Coastal Migration. The proportion of haplogroup M increases eastwards from Arabia to India; in eastern India, M outnumbers N by a ratio of 3:1. Crossing into Southeast Asia, haplogroup N (mostly in the form of derivatives of its R subclade) reappears as the predominant lineage. M is predominant in East Asia, but amongst Indigenous Australians, N is the more common lineage. This haphazard distribution of Haplogroup N from Europe to Australia can be explained by founder effects and population bottlenecks.

The earliest-branching non-African paternal lineages (C, D, F) after the Out-of-Africa event (a), and their deepest divergence among modern day East or Southeast Asia (b), suggesting rapid coastal expansions. Simplified Y-chromosome tree is shown as reference for colours.

Autosomal DNA

A 2002 study of African, European, and Asian populations, found greater genetic diversity among Africans than among Eurasians, and that genetic diversity among Eurasians is largely a subset of that among Africans, supporting the out of Africa model. A large study by Coop et al. (2009) found evidence for natural selection in autosomal DNA outside of Africa. The study distinguishes non-African sweeps (notably KITLG variants associated with skin color), West-Eurasian sweeps (SLC24A5) and East-Asian sweeps (MC1R, relevant to skin color). Based on this evidence, the study concluded that human populations encountered novel selective pressures as they expanded out of Africa. MC1R and its relation to skin color had already been discussed by Harding et al. (2000), p. 1355. According to this study, Papua New Guineans continued to be exposed to selection for dark skin color so that, although these groups are distinct from Africans in other places, the allele for dark skin color shared by contemporary Africans, Andamanese and New Guineans is an archaism. Endicott et al. (2003) suggest convergent evolution. A 2014 study by Gurdasani et al. indicates that the higher genetic diversity in Africa was further increased in some regions by relatively recent Eurasian migrations affecting parts of Africa.

Pathogen DNA

Another promising route towards reconstructing human genetic genealogy is via the JC virus (JCV), a type of human polyomavirus which is carried by 70–90 percent of humans and which is usually transmitted vertically, from parents to offspring, suggesting codivergence with human populations. For this reason, JCV has been used as a genetic marker for human evolution and migration. This method does not appear to be reliable for the migration out of Africa; in contrast to human genetics, JCV strains associated with African populations are not basal. From this Shackelton et al. (2006) conclude that either a basal African strain of JCV has become extinct or that the original infection with JCV post-dates the migration from Africa.

Admixture of archaic and modern humans

Evidence for archaic human species (descended from Homo heidelbergensis) having interbred with modern humans outside of Africa, was discovered in the 2010s. This concerns primarily Neanderthal admixture in all modern populations except for Sub-Saharan Africans but evidence has also been presented for Denisova hominin admixture in Australasia (i.e. in Melanesians, Aboriginal Australians and some Negritos). The rate of Neanderthal admixture to European and Asian populations as of 2017 has been estimated at between about 2–3%.

Archaic admixture in some Sub-Saharan African populations hunter-gatherer groups (Biaka Pygmies and San), derived from archaic hominins that broke away from the modern human lineage around 700,000 years ago, was discovered in 2011. The rate of admixture was estimated at 2%. Admixture from archaic hominins of still earlier divergence times, estimated at 1.2 to 1.3 million years ago, was found in Pygmies, Hadza and five Sandawe in 2012.

From an analysis of Mucin 7, a highly divergent haplotype that has an estimated coalescence time with other variants around 4.5 million years BP and is specific to African populations, it is inferred to have been derived from interbreeding between African modern and archaic humans.

A study published in 2020 found that the Yoruba and Mende populations of West Africa derive between 2% and 19% of their genome from an as-yet unidentified archaic hominin population that likely diverged before the split of modern humans and the ancestors of Neanderthals and Denisovans.

Stone tools

In addition to genetic analysis, Petraglia et al. also examines the small stone tools (microlithic materials) from the Indian subcontinent and explains the expansion of population based on the reconstruction of paleoenvironment. He proposed that the stone tools could be dated to 35 ka in South Asia, and the new technology might be influenced by environmental change and population pressure.

History of the theory

Classical paleoanthropology

The frontispiece to Huxley's Evidence as to Man's Place in Nature (1863): the image compares the skeleton of a human to other apes.

The cladistic relationship of humans with the African apes was suggested by Charles Darwin after studying the behaviour of African apes, one of which was displayed at the London Zoo. The anatomist Thomas Huxley had also supported the hypothesis and suggested that African apes have a close evolutionary relationship with humans. These views were opposed by the German biologist Ernst Haeckel, who was a proponent of the Out of Asia theory. Haeckel argued that humans were more closely related to the primates of South-east Asia and rejected Darwin's African hypothesis.

In the Descent of Man, Darwin speculated that humans had descended from apes, which still had small brains but walked upright, freeing their hands for uses which favoured intelligence; he thought such apes were African:

In each great region of the world the living mammals are closely related to the extinct species of the same region. It is, therefore, probable that Africa was formerly inhabited by extinct apes closely allied to the gorilla and chimpanzee; and as these two species are now man's nearest allies, it is somewhat more probable that our early progenitors lived on the African continent than elsewhere. But it is useless to speculate on this subject, for an ape nearly as large as a man, namely the Dryopithecus of Lartet, which was closely allied to the anthropomorphous Hylobates, existed in Europe during the Upper Miocene period; and since so remote a period the earth has certainly undergone many great revolutions, and there has been ample time for migration on the largest scale.

— Charles Darwin, Descent of Man

In 1871, there were hardly any human fossils of ancient hominins available. Almost fifty years later, Darwin's speculation was supported when anthropologists began finding fossils of ancient small-brained hominins in several areas of Africa (list of hominina fossils). The hypothesis of recent (as opposed to archaic) African origin developed in the 20th century. The "Recent African origin" of modern humans means "single origin" (monogenism) and has been used in various contexts as an antonym to polygenism. The debate in anthropology had swung in favour of monogenism by the mid-20th century. Isolated proponents of polygenism held forth in the mid-20th century, such as Carleton Coon, who thought as late as 1962 that H. sapiens arose five times from H. erectus in five places.

The possibility of an origin of L3 in Asia was proposed by Cabrera et al. (2018).
a: Exit of the L3 precursor to Eurasia. b: Return to Africa and expansion to Asia of basal L3 lineages with subsequent differentiation in both continents.

Multiregional origin hypothesis

The historical alternative to the recent origin model is the multiregional origin of modern humans, initially proposed by Milford Wolpoff in the 1980s. This view proposes that the derivation of anatomically modern human populations from H. erectus at the beginning of the Pleistocene 1.8 million years BP, has taken place within a continuous world population. The hypothesis necessarily rejects the assumption of an infertility barrier between ancient Eurasian and African populations of Homo. The hypothesis was controversially debated during the late 1980s and the 1990s. The now-current terminology of "recent-origin" and "Out of Africa" became current in the context of this debate in the 1990s. Originally seen as an antithetical alternative to the recent origin model, the multiregional hypothesis in its original "strong" form is obsolete, while its various modified weaker variants have become variants of a view of "recent origin" combined with archaic admixture. Stringer (2014) distinguishes the original or "classic" Multiregional model as having existed from 1984 (its formulation) until 2003, to a "weak" post-2003 variant that has "shifted close to that of the Assimilation Model".

Mitochondrial analyses

In the 1980s, Allan Wilson together with Rebecca L. Cann and Mark Stoneking worked on genetic dating of the matrilineal most recent common ancestor of modern human populations (dubbed "Mitochondrial Eve"). To identify informative genetic markers for tracking human evolutionary history, Wilson concentrated on mitochondrial DNA (mtDNA), passed from mother to child. This DNA material mutates quickly, making it easy to plot changes over relatively short times. With his discovery that human mtDNA is genetically much less diverse than chimpanzee mtDNA, Wilson concluded that modern human populations had diverged recently from a single population while older human species such as Neanderthals and Homo erectus had become extinct. With the advent of archaeogenetics in the 1990s, the dating of mitochondrial and Y-chromosomal haplogroups became possible with some confidence. By 1999, estimates ranged around 150,000 years for the mt-MRCA and 60,000 to 70,000 years for the migration out of Africa.

From 2000 to 2003, there was controversy about the mitochondrial DNA of "Mungo Man 3" (LM3) and its possible bearing on the multiregional hypothesis. LM3 was found to have more than the expected number of sequence differences when compared to modern human DNA (CRS). Comparison of the mitochondrial DNA with that of ancient and modern aborigines, led to the conclusion that Mungo Man fell outside the range of genetic variation seen in Aboriginal Australians and was used to support the multiregional origin hypothesis. A reanalysis of LM3 and other ancient specimens from the area published in 2016, showed it to be akin to modern Aboriginal Australian sequences, inconsistent with the results of the earlier study.

Y-chromosome analyses

Map of Y-Chromosome Haplogroups – Dominant haplogroups in pre-colonial populations with proposed migrations routes

As current estimates on the male most recent common ancestor ("Y-chromosomal Adam" or Y-MRCA) converge with estimates for the age of anatomically modern humans, and well predate the Out of Africa migration, geographical origin hypotheses continue to be limited to the African continent.

The most basal lineages have been detected in West, Northwest and Central Africa, suggesting plausibility for the Y-MRCA living in the general region of "Central-Northwest Africa".

Another study finds a plausible placement in "the north-western quadrant of the African continent" for the emergence of the A1b haplogroup. The 2013 report of haplogroup A00 found among the Mbo people of western present-day Cameroon is also compatible with this picture.

The revision of Y-chromosomal phylogeny since 2011 has affected estimates for the likely geographical origin of Y-MRCA as well as estimates on time depth. By the same reasoning, future discovery of presently-unknown archaic haplogroups in living people would again lead to such revisions. In particular, the possible presence of between 1% and 4% Neanderthal-derived DNA in Eurasian genomes implies that the (unlikely) event of a discovery of a single living Eurasian male exhibiting a Neanderthal patrilineal line would immediately push back T-MRCA ("time to MRCA") to at least twice its current estimate. However, the discovery of a Neanderthal Y-chromosome by Mendez et al. was tempered by a 2016 study that suggests the extinction of Neanderthal patrilineages, as the lineage inferred from the Neanderthal sequence is outside of the range of contemporary human genetic variation. Questions of geographical origin would become part of the debate on Neanderthal evolution from Homo erectus.

De novo gene birth

From Wikipedia, the free encyclopedia
Novel genes can emerge from ancestrally non-genic regions through poorly understood mechanisms. (A) A non-genic region first gains transcription and an open reading frame (ORF), in either order, facilitating the birth of a de novo gene. The ORF is for illustrative purposes only, as de novo genes may also be multi-exonic, or lack an ORF, as with RNA genes. (B) Overprinting. A novel ORF is created that overlaps with an existing ORF, but in a different frame. (C) Exonization. A formerly intronic region becomes alternatively spliced as an exon, such as when repetitive sequences are acquired through retroposition and new splice sites are created through mutational processes. Overprinting and exonization may be considered as special cases of de novo gene birth.
Novel genes can be formed from ancestral genes through a variety of mechanisms. (A) Duplication and divergence. Following duplication, one copy experiences relaxed selection and gradually acquires novel function(s). (B) Gene fusion. A hybrid gene formed from some or all of two previously separate genes. Gene fusions can occur by different mechanisms; shown here is an interstitial deletion. (C) Gene fission. A single gene separates to form two distinct genes, such as by duplication and differential degeneration of the two copies. (D) Horizontal gene transfer. Genes acquired from other species by horizontal transfer undergo divergence and neofunctionalization. (E) Retroposition. Transcripts may be reverse transcribed and integrated as an intronless gene elsewhere in the genome. This new gene may then undergo divergence.

De novo gene birth is the process by which new genes evolve from non-coding DNA. De novo genes represent a subset of novel genes, and may be protein-coding or instead act as RNA genes. The processes that govern de novo gene birth are not well understood, although several models exist that describe possible mechanisms by which de novo gene birth may occur.

Although de novo gene birth may have occurred at any point in an organism's evolutionary history, ancient de novo gene birth events are difficult to detect. Most studies of de novo genes to date have thus focused on young genes, typically taxonomically restricted genes (TRGs) that are present in a single species or lineage, including so-called orphan genes, defined as genes that lack any identifiable homolog. It is important to note, however, that not all orphan genes arise de novo, and instead may emerge through fairly well characterized mechanisms such as gene duplication (including retroposition) or horizontal gene transfer followed by sequence divergence, or by gene fission/fusion.

Although de novo gene birth was once viewed as a highly unlikely occurrence, several unequivocal examples have now been described, and some researchers speculate that de novo gene birth could play a major role in evolutionary innovation. The 'pleiotropy-barrier' model suggests that newly evolved genes could facilitate evolutionary innovation due to their low (or no) pleiotropic effect.

History

As early as the 1930s, J. B. S. Haldane and others suggested that copies of existing genes may lead to new genes with novel functions. In 1970, Susumu Ohno published the seminal text Evolution by Gene Duplication. For some time subsequently, the consensus view was that virtually all genes were derived from ancestral genes, with François Jacob famously remarking in a 1977 essay that "the probability that a functional protein would appear de novo by random association of amino acids is practically zero."

In the same year, however, Pierre-Paul Grassé coined the term "overprinting" to describe the emergence of genes through the expression of alternative open reading frames (ORFs) that overlap preexisting genes. These new ORFs may be out of frame with or antisense to the preexisting gene. They may also be in frame with the existing ORF, creating a truncated version of the original gene, or represent 3’ extensions of an existing ORF into a nearby ORF. The first two types of overprinting may be thought of as a particular subtype of de novo gene birth; although overlapping with a previously coding region of the genome, the primary amino-acid sequence of the new protein is entirely novel and derived from a frame that did not previously contain a gene. The first examples of this phenomenon in bacteriophages were reported in a series of studies from 1976 to 1978, and since then numerous other examples have been identified in viruses, bacteria, and several eukaryotic species.

The phenomenon of exonization also represents a special case of de novo gene birth, in which, for example, often-repetitive intronic sequences acquire splice sites through mutation, leading to de novo exons. This was first described in 1994 in the context of Alu sequences found in the coding regions of primate mRNAs. Interestingly, such de novo exons are frequently found in minor splice variants, which may allow the evolutionary “testing” of novel sequences while retaining the functionality of the major splice variant(s).

Still, it was thought by some that most or all eukaryotic proteins were constructed from a constrained pool of “starter type” exons. Using the sequence data available at the time, a 1991 review estimated the number of unique, ancestral eukaryotic exons to be < 60,000, while in 1992 a piece was published estimating that the vast majority of proteins belonged to no more than 1,000 families. Around the same time, however, the sequence of chromosome III of the budding yeast Saccharomyces cerevisiae was released, representing the first time an entire chromosome from any eukaryotic organism had been sequenced. Sequencing of the entire yeast nuclear genome was then completed by early 1996 through a massive, collaborative international effort. In his review of the yeast genome project, Bernard Dujon noted that the unexpected abundance of genes lacking any known homologs was perhaps the most striking finding of the entire project.

In 2006 and 2007, a series of studies provided arguably the first documented examples of de novo gene birth that did not involve overprinting. These studies were conducted using the accessory gland transcriptomes of Drosophila yakuba and Drosophila erecta and they identified 20 putative lineage-restricted genes that appeared unlikely to have resulted from gene duplication. Levine and colleagues identified and confirmed five de novo candidate genes specific to Drosophila melanogaster and/or the closely related Drosophila simulans through a rigorous approach that combined bioinformatic and experimental techniques.

Since these initial studies, many groups have identified specific cases of de novo gene birth events in diverse organisms. The first de novo gene identified in yeast, BSC4 gene was identified in S. cerevisiae in 2008. This gene shows evidence of purifying selection, is expressed at both the mRNA and protein levels, and when deleted is synthetically lethal with two other yeast genes, all of which indicate a functional role for the BSC4 gene product. Historically, one argument against the notion of widespread de novo gene birth is the evolved complexity of protein folding. Interestingly, Bsc4 was later shown to adopt a partially folded state that combines properties of native and non-native protein folding. In plants, the first de novo gene to be functionally characterized was QQS, an Arabidopsis thaliana gene identified in 2009 that regulates carbon and nitrogen metabolism. The first functionally characterized de novo gene identified in mice, a noncoding RNA gene, was also described in 2009. In primates, a 2008 informatic analysis estimated that 15/270 primate orphan genes had been formed de novo. A 2009 report identified the first three de novo human genes, one of which is a therapeutic target in chronic lymphocytic leukemia. Since this time, a plethora of genome-level studies have identified large numbers of orphan genes in many organisms, although the extent to which they arose de novo, and the degree to which they can be deemed functional, remain debated.

Identification

Identification of de novo emerging sequences

There are two major approaches to the systematic identification of novel genes: genomic phylostratigraphy and synteny-based methods. Both approaches are widely used, individually or in a complementary fashion.

Genomic phylostratigraphy

Genomic phylostratigraphy involves examining each gene in a focal, or reference, species and inferring the presence or absence of ancestral homologs through the use of the BLAST sequence alignment algorithms or related tools. Each gene in the focal species can be assigned an age (aka “conservation level” or “genomic phylostratum”) that is based on a predetermined phylogeny, with the age corresponding to the most distantly related species in which a homolog is detected. When a gene lacks any detectable homolog outside of its own genome, or close relatives, it is said to be a novel, taxonomically restricted or orphan gene.

Phylostratigraphy is limited by the set of closely related genomes that are available, and results are dependent on BLAST search criteria. In addition, it is often difficult to determine based on lack of observed sequence similarity whether a novel gene has emerged de novo or has diverged from an ancestral gene beyond recognition, for instance following a duplication event. This was pointed out by a study that simulated the evolution of genes of equal age and found that distant orthologs can be undetectable for rapidly evolving genes. On the other hand, when accounting for changes in the rate of evolution in young regions of genes, a phylostratigraphic approach was more accurate at assigning gene ages in simulated data. Subsequent studies using simulated evolution found that phylostratigraphy failed to detect an ortholog in the most distantly related species for 13.9% of D. melanogaster genes and 11.4% of S. cerevisiae genes. However, a reanalysis of studies that used phylostratigraphy in yeast, fruit flies and humans found that even when accounting for such error rates and excluding difficult-to-stratify genes from the analyses, the qualitative conclusions were unaffected. The impact of phylostratigraphic bias on studies examining various features of de novo genes remains debated.

Synteny-based approaches

Synteny-based approaches use order and relative positioning of genes (or other features) to identify the potential ancestors of candidate de novo genes. Syntenic alignments are anchored by conserved “markers.” Genes are the most common marker in defining syntenic blocks, although k-mers and exons are also used. Confirmation that the syntenic region lacks coding potential in outgroup species allows a de novo origin to be asserted with higher confidence. The strongest possible evidence for de novo emergence is the inference of the specific "enabling" mutation(s) that created coding potential, typically through the analysis of smaller sequence regions, termed microsyntenic regions, of closely related species.

One challenge in applying synteny-based methods is that synteny can be difficult to detect across longer timescales. To address this, various optimization techniques have been created, such as using exons clustered irrespective of their specific order to define syntenic blocks or algorithms that use well-conserved genomic regions to expand microsyntenic blocks. There are also difficulties associated with applying synteny-based approaches to genome assemblies that are fragmented or in lineages with high rates of chromosomal rearrangements, as is common in insects. Synteny-based approaches can be applied to genome-wide surveys of de novo genes and represent a promising area of algorithmic development for gene birth dating. Some have used synteny-based approaches in combination with similarity searches in an attempt to develop standardized, stringent pipelines that can be applied to any group of genomes in an attempt to address discrepancies in the various lists of de novo genes that have been generated.

Determination of status

Even when the evolutionary origin of a particular coding sequence has been established, there is still a lack of consensus about what constitutes a genuine de novo gene birth event. One reason for this is a lack of agreement on whether or not the entirety of the sequence must be non-genic in origin. For protein-coding de novo genes, it has been proposed that de novo genes be divided into subtypes based on the proportion of the ORF in question that was derived from a previously noncoding sequence. Furthermore, for de novo gene birth to occur, the sequence in question must be a gene which has led to a questioning of what constitutes a gene, with some models establishing a strict dichotomy between genic and non-genic sequences, and others proposing a more fluid continuum.

All definitions of genes are linked to the notion of function, as it is generally agreed that a genuine gene should encode a functional product, be it RNA or protein. There are, however, different views of what constitutes function, depending whether a given sequence is assessed using genetic, biochemical, or evolutionary approaches. The ambiguity of the concept of ‘function’ is especially problematic for the de novo gene birth field, where the objects of study are often rapidly evolving. To address these challenges, the Pittsburgh Model of Function deconstructs ‘function’ into five meanings to describe the different properties that are acquired by a locus undergoing de novo gene birth : Expression, Capacities, Interactions, Physiological Implications, and Evolutionary Implications.

It is generally accepted that a genuine de novo gene is expressed in at least some context, allowing selection to operate, and many studies use evidence of expression as an inclusion criterion in defining de novo genes. The expression of sequences at the mRNA level may be confirmed individually through techniques such as quantitative PCR, or globally through RNA sequencing (RNA-seq). Similarly, expression at the protein level can be determined with high confidence for individual proteins using techniques such as mass spectrometry or western blotting, while ribosome profiling (Ribo-seq) provides a global survey of translation in a given sample. Ideally, to confirm a gene arose de novo, a lack of expression of the syntenic region of outgroup species would also be demonstrated.

Genetic approaches to detect a specific phenotype or change in fitness upon disruption of a particular sequence, are useful to infer function. Other experimental approaches, including screens for protein-protein and/or genetic interactions, may also be employed to confirm a biological effect for a particular de novo ORF.

Evolutionary approaches may be employed to infer the existence of a molecular function from computationally derived signatures of selection. In the case of TRGs, one common signature of selection is the ratio of nonsynonymous to synonymous substitutions (dN/dS ratio), calculated from different species from the same taxon. Similarly, in the case of species-specific genes, polymorphism data may be used to calculate a pN/pS ratio from different strains or populations of the focal species. Given that young, species-specific de novo genes lack deep conservation by definition, detecting statistically significant deviations from 1 can be difficult without an unrealistically large number of sequenced strains/populations. An example of this can be seen in Mus musculus, where three very young de novo genes lack signatures of selection despite well-demonstrated physiological roles. For this reason, pN/pS approaches are often applied to groups of candidate genes, allowing researchers to infer that at least some of them are evolutionarily conserved, without being able to specify which. Other signatures of selection, such as the degree of nucleotide divergence within syntenic regions, conservation of ORF boundaries, or for protein-coding genes, a coding score based on nucleotide hexamer frequencies, have instead been employed.

Prevalence

Estimates of numbers

Frequency and number estimates of de novo genes in various lineages vary widely and are highly dependent on methodology. Studies may identify de novo genes by phylostratigraphy/BLAST-based methods alone, or may employ a combination of computational techniques, and may or may not assess experimental evidence for expression and/or biological role. Furthermore, genome-scale analyses may consider all or most ORFs in the genome, or may instead limit their analysis to previously annotated genes.

The D. melanogaster lineage is illustrative of these differing approaches. An early survey using a combination of BLAST searches performed on cDNA sequences along with manual searches and synteny information identified 72 new genes specific to D. melanogaster and 59 new genes specific to three of the four species in the D. melanogaster species complex. This report found that only 2/72 (~2.8%) of D. melanogaster-specific new genes and 7/59 (~11.9%) of new genes specific to the species complex were derived de novo, with the remainder arising via duplication/retroposition. Similarly, an analysis of 195 young (<35 million years old) D. melanogaster genes identified from syntenic alignments found that only 16 had arisen de novo. In contrast, an analysis focused on transcriptomic data from the testes of six D. melanogaster strains identified 106 fixed and 142 segregating de novo genes. For many of these, ancestral ORFs were identified but were not expressed. A newer study found that up to 39 % of orphan genes in the Drosophila clade may have emerged de novo, as they overlap with non-coding regions of the genome. Highlighting the differences between inter- and intra-species comparisons, a study in natural Saccharomyces paradoxus populations found that the number of de novo polypeptides identified more than doubled when considering intra-species diversity. In primates, one early study identified 270 orphan genes (unique to humans, chimpanzees, and macaques), of which 15 were thought to have originated de novo. Later reports identified many more de novo genes in humans alone that are supported by transcriptional and proteomic evidence. Studies in other lineages/organisms have also reached different conclusions with respect to the number of de novo genes present in each organism, as well as the specific sets of genes identified. A sample of these large-scale studies is described in the table below.

Generally speaking, it remains debated whether duplication and divergence or de novo gene birth represent the dominant mechanism for the emergence of new genes, in part because de novo genes are likely to both emerge and be lost more frequently than other young genes. In a study on the origin of orphan genes in 3 different eukaryotic lineages, authors found that on average only around 30% of orphan genes can be explained by sequence divergence.

Dynamics

It is important to distinguish between the frequency of de novo gene birth and the number of de novo genes in a given lineage. If de novo gene birth is frequent, it might be expected that genomes would tend to grow in their gene content over time; however, the gene content of genomes is usually relatively stable. This implies that a frequent gene death process must balance de novo gene birth, and indeed, de novo genes are distinguished by their rapid turnover relative to established genes. In support of this notion, recently emerged Drosophila genes are much more likely to be lost, primarily through pseudogenization, with the youngest orphans being lost at the highest rate; this is despite the fact that some Drosophila orphan genes have been shown to rapidly become essential. A similar trend of frequent loss among young gene families was observed in the nematode genus Pristionchus. Similarly, an analysis of five mammalian transcriptomes found that most ORFs in mice were either very old or species specific, implying frequent birth and death of de novo transcripts. A comparable trend could be shown by further analyses of six primate transcriptomes. In wild S. paradoxus populations, de novo ORFs emerge and are lost at similar rates. Nevertheless, there remains a positive correlation between the number of species-specific genes in a genome and the evolutionary distance from its most recent ancestor. A rapid gain and loss of de novo genes was also found on a population level by analyzing nine natural three-spined stickleback populations. In addition to the birth and death of de novo genes at the level of the ORF, mutational and other processes also subject genomes to constant “transcriptional turnover”. One study in murines found that while all regions of the ancestral genome were transcribed at some point in at least one descendant, the portion of the genome under active transcription in a given strain or subspecies is subject to rapid change. The transcriptional turnover of noncoding RNA genes is particularly fast compared to coding genes.

Example de novo gene table

Organism/Lineage Gene Evidence of

de novo origin

Evidence of selection Phenotypic evidence Year discovered Notes
Arabidopsis thaliana QQS
N/A Excess leaf starch in RNAi knockdowns 2009

Drosophila CG9284 Syntenic alignments of 12 Drosophila species
RNAi knockdown is lethal 2010

Drosophila CG30395 Syntenic alignments of 12 Drosophila species
RNAi knockdown is lethal 2010

Drosophila CG31882 Syntenic alignments of 12 Drosophila species
RNAi knockdown is lethal 2010

Drosophila CG31406 tBLASTn of protein-coding regions to all 12 Drosophila genomes and comparison of BLASTZ alignments dN/dS <1 indicates purifying selection RNAi knockdown inhibits fertility 2013

Drosophila CG32582 tBLASTn of protein-coding regions to all 12 Drosophila genomes and comparison of BLASTZ alignments Possible positive selection but not statistically significant RNAi knockdown inhibits fertility 2013

Drosophila CG33235 tBLASTn of protein-coding regions to all 12 Drosophila genomes and comparison of BLASTZ alignments dN/dS <1 indicates purifying selection RNAi knockdown inhibits fertility 2013

Drosophila CG34434 tBLASTn of protein-coding regions to all 12 Drosophila genomes and comparison of BLASTZ alignments dN/dS <1 indicates purifying selection RNAi knockdown inhibits fertility 2013

Drosophila melanogaster goddard Genome-wide tblastn searches and LASTZ- and Exonerate-based analyses of the syntenic regions
essential for individualization of elongated spermatids;

RNAi knockdown experiments in male flies

2017 Structure prediction: half disordered, half alpha-helical
Drosophila simulans and Drosophila sechellia Dsim_GD19764 and Dsec_GM10790 Exonerate-based analyses of the syntenic regions Conservation across two sister species Testes expression 2020 Born inside intron of another gene, contains conserved intron present at time of birth (length not multiple of 3). Structure prediction: contains a transmembrane alpha helix
Gadidae AFGP Examination of Gadid phylogeny Gene multiplied in Gadid species in colder habitats but decayed in species not under threat of freezing Inhibit ice growth formation
Function is similar to other antifreeze proteins that evolved independently
Mus Gm13030 Combined phylostratigraphy and synteny approach ORF only retained in M. m. musculus and M. m. castaneus populations; no evidence of positive selection Knockout mutant has irregular pregnancy cycles 2019

Mus Poldi Homologous region not expressed in closely related and outgroup species Evidence of recent selective sweep in M. m. musculus Knockout mutant has reduced sperm motility and testis weight 2009 RNA gene
Placental Mammals ORF-Y PhyloCSF of POLG gene in Homo sapiens, synonymous site conservation across mammals, and tBLASTN of mammals, sauropsids, amphibians, and teleost fish Disappearance of enhanced synonymous site conservation within the POLG ORF after the ORF-Y’s stop codon and high conservation of the initiation context of the start codon indicate purifying selection 41 Clinvar variants that affect the ORF-Y peptide but not the amino acid sequence of POLG 2020

Saccharomyces

cerevisiae

BSC4 tBLASTN and syntenic alignments of closely related species Under negative selection based on population data Has two synthetic lethal partners 2008 Adopts a partially specific three-dimensional structure
Saccharomyces

cerevisiae

MDF1 Only identified putative homologs are truncated, non-expressed, non-functional ORFs Fixed in 39 diverse strains, no frameshift or nonsense mutations Decreases mating efficiency by binding MATα2; promotes growth through an interaction with Snf1 2010 Expression is suppressed by its antisense gene

Features

General Features

Recently emerged de novo genes differ from established genes in a number of ways. Across a broad range of species, young and/or taxonomically restricted genes have been reported to be shorter in length than established genes, to evolve more rapidly, and to be less expressed. Although these trends could be a result of homology detection bias, a reanalysis of several studies that accounted for this bias found that the qualitative conclusions reached were unaffected. Another feature includes the tendency for young genes to have their hydrophobic amino acids more clustered near one another along the primary sequence.

The expression of young genes has also been found to be more tissue- or condition-specific than that of established genes. In particular, relatively high expression of de novo genes was observed in male reproductive tissues in Drosophila, stickleback, mice, and humans, and, in the human brain. In animals with adaptive immune systems, higher expression in the brain and testes may be a function of the immune-privileged nature of these tissues. An analysis in mice found specific expression of intergenic transcripts in the thymus and spleen (in addition to the brain and testes). It has been proposed that in vertebrates de novo transcripts must first be expressed in tissues lacking immune cells before they can be expressed in tissues that have immune surveillance.

Features that promote de novo gene birth

Its also of interest to compare features of recently emerged de novo genes to the pool of non-genic ORFs from which they emerge. Theoretical modeling has shown that such differences are the product both of selection for features that increase the likelihood of functionalization, and of neutral evolutionary forces that influence allelic turnover. Experiments in S. cerevisiae showed that predicted transmembrane domains were strongly associated with beneficial fitness effects when young ORFs were overexpressed, but not when established (older) ORFs were overexpressed. Experiments in E. coli showed that random peptides tended to have more benign effects when they were enriched for amino acids that were small, and that promoted intrinsic structural disorder.

Lineage-dependent features

Features of de novo genes can depend on the species or lineage being examined. This appears to partly be a result of varying GC content in genomes and that young genes bear more similarity to non-genic sequences from the genome in which they arose than do established genes. Features in the resulting protein, such as the percentage of transmembrane residues and the relative frequency of various predicted secondary structural features show a strong GC dependency in orphan genes, whereas in more ancient genes these features are only weakly influenced by GC content.

The relationship between gene age and the amount of predicted intrinsic structural disorder (ISD) in the encoded proteins has been subject to considerable debate. It has been claimed that ISD is also a lineage-dependent feature, exemplified by the fact that in organisms with relatively high GC content, ranging from D. melanogaster to the parasite Leishmania major, young genes have high ISD, while in a low GC genome such as budding yeast, several studies have shown that young genes have low ISD. However, a study that excluded young genes with dubious evidence for functionality, defined in binary terms as being under selection for gene retention, found that the remaining young yeast genes have high ISD, suggesting that the yeast result may be due to contamination of the set of young genes with ORFs that do not meet this definition, and hence are more likely to have properties that reflect GC content and other non-genic features of the genome. Beyond the very youngest orphans, this study found that ISD tends to decrease with increasing gene age, and that this is primarily due to amino acid composition rather than GC content. Within shorter time scales ,using de novo genes that have the most validation suggests that younger genes are more disordered in Lachancea, but less disordered in Saccharomyces. Intrinsic structural disorder and aggregation propensity did not show significant differences with age in some studies of mammals  and primates, but did in other studies of mammals. One large study of the entire Pfam protein domain database showed enrichment of younger protein domain for disorder-promoting amino acids across animals, but enrichment on the basis of amino acid availability in plants.

Role of epigenetic modifications

An examination of de novo genes in A. thaliana found that they are both hypermethylated and generally depleted of histone modifications. In agreement with either the proto-gene model or contamination with non-genes, methylation levels of de novo genes were intermediate between established genes and intergenic regions. The methylation patterns of these de novo genes are stably inherited, and methylation levels were highest, and most similar to established genes, in de novo genes with verified protein-coding ability. In the pathogenic fungus Magnaporthe oryzae, less conserved genes tend to have methylation patterns associated with low levels of transcription. A study in yeasts also found that de novo genes are enriched at recombination hotspots, which tend to be nucleosome-free regions.

In Pristionchus pacificus, orphan genes with confirmed expression display chromatin states that differ from those of similarly expressed established genes. Orphan gene start sites have epigenetic signatures that are characteristic of enhancers, in contrast to conserved genes that exhibit classical promoters. Many unexpressed orphan genes are decorated with repressive histone modifications, while a lack of such modifications facilitates transcription of an expressed subset of orphans, supporting the notion that open chromatin promotes the formation of novel genes.

Structural features

As structure is usually more conserved than sequence, comparing structures between orthologs could provide deeper insides into de novo gene emergence and evolution and help to confirm these genes as true de novo genes. Nevertheless, so far only very few de novo proteins have been structurally and functionally characterized, especially due to problems with protein purification and subsequent stability. Progresses have been made using different purification tags, cell types and chaperones. 

The ‘antifreeze glycoprotein’ (AFGP) in Arctic codfishes prevents their blood from freezing in arctic waters. Bsc4, a short non-essential de novo protein in yeast, has been shown to be built mainly by beta-sheets and has a hydrophobic core. It is associated to DNA repair under nutrient-deficient conditions. The Drosophila de novo protein Goddard has been characterized for the first time in 2017. Knockdown Drosophila melanogaster male flies were not able to produce sperm. Recently, it could be shown that this lack was due to failure of individualization of elongated spermatids. By using computational phylogenomic and structure predictions, experimental structural analyses, and cell biological assays, it was proposed that half of Goddard's structure is disordered and the other half is composed by alpha-helical amino acids. These analyses also indicated that Goddard's orthologs show similar results. Goddard's structure therefore appears to have been mainly conserved since its emergence.

Mechanisms

Pervasive expression

With the development of technologies such as RNA-seq and Ribo-seq, eukaryotic genomes are now known to be pervasively transcribed and translated. Many ORFs that are either unannotated, or annotated as long non-coding RNAs (lncRNAs), are translated at some level, either in a condition or tissue-specific manner. Though infrequent, these translation events expose non-genic sequence to selection. This pervasive expression forms the basis for several models describing de novo gene birth.

It has been speculated that the epigenetic landscape of de novo genes in the early stages of formation may be particularly variable between and among populations, resulting in variable gene expression thereby allowing young genes to explore the “expression landscape.” The QQS gene in A. thaliana is one example of this phenomenon; its expression is negatively regulated by DNA methylation that, while heritable for several generations, varies widely in its levels both among natural accessions and within wild populations. Epigenetics are also largely responsible for the permissive transcriptional environment in the testes, particularly through the incorporation into nucleosomes of non-canonical histone variants that are replaced by histone-like protamines during spermatogenesis.

Intergenic ORFs as elementary structural modules

Analysis of the fold potential diversity shows that the majority of the amino acid sequences encoded by the intergenic ORFs of S. cerevisiae are predicted to be foldable. More importantly, these amino acid sequences with folding potential can serve as elementary building blocks for de novo genes or integrate into pre-existing genes.

Order of events

For birth of a de novo protein-coding gene to occur, a non-genic sequence must both be transcribed and acquire an ORF before becoming translated. These events could occur in either order, and there is evidence supporting both an “ORF first” and a “transcription first” model. An analysis of de novo genes that are segregating in D. melanogaster found that sequences that are transcribed had similar coding potential to the orthologous sequences from lines lacking evidence of transcription. This finding supports the notion that many ORFs can exist prior to being transcribed. The antifreeze glycoprotein gene AFGP, which emerged de novo in Arctic codfishes, provides a more definitive example in which the de novo emergence of the ORF was shown to precede the promoter region. Furthermore, putatively non-genic ORFs long enough to encode functional peptides are numerous in eukaryotic genomes, and expected to occur at high frequency by chance. Through tracing the evolution history of ORF sequences and transcription activation of human de novo genes, a study showed that some ORFs were ready to confer biological significance upon their birth. At the same time, transcription of eukaryotic genomes is far more extensive than previously thought, and there are documented examples of genomic regions that were transcribed prior to the appearance of an ORF that became a de novo gene. The proportion of de novo genes that are protein-coding is unknown, but the appearance of “transcription first” has led some to posit that protein-coding de novo genes may first exist as RNA gene intermediates. The case of bifunctional RNAs, which are both translated and function as RNA genes, shows that such a mechanism is plausible.

The two events may occur simultaneously when chromosomal rearrangement is the event that precipitates gene birth.

Models

Several theoretical models and possible mechanisms of de novo gene birth have been described. The models are generally not mutually exclusive, and it is possible that multiple mechanisms may give rise to de novo genes. An example is the type III antifreeze protein gene, which originates from an old sialic acid synthase (SAS) gene, in an Antarctic zoarcid fish.

“Out of Testis” hypothesis

An early case study of de novo gene birth, which identified five de novo genes in D. melanogaster, noted preferential expression of these genes in the testes, and several additional de novo genes were identified using transcriptomic data derived from the testes and male accessory glands of D. yakuba and D. erecta. This is in agreement with other studies that showed there is rapid evolution of genes related to reproduction across a range of lineages, suggesting that sexual selection may play a key role in adaptive evolution and de novo gene birth. A subsequent large-scale analysis of six D. melanogaster strains identified 248 testis-expressed de novo genes, of which ~57% were not fixed. A recent study on twelve Drosophila species additionally identified a higher proportion of de novo genes with testis-biased expression compared to annotated proteome. It has been suggested that the large number of de novo genes with male-specific expression identified in Drosophila is likely due to the fact that such genes are preferentially retained relative to other de novo genes, for reasons that are not entirely clear. Interestingly, two putative de novo genes in Drosophila (Goddard and Saturn) were shown to be required for normal male fertility. A genetic screen of over 40 putative de novo genes with testis-enriched expression in Drosophila melanogaster revealed that one of the de novo genes, atlas, was required for proper chromatin condensation during the final stages of spermatogenesis in male. atlas evolved from the fusion of a protein-coding gene that arose at the base of Drosophila genus and a conserved non-coding RNA. Comparative analysis of the transcriptomes of testis and accessory glands, a somatic tissue of males that is important for fertility, of D. melanogaster suggests that de novo genes make greater contribution to the transcriptomic complexity of testis as compared to accessory glands. Single-cell RNA-seq of D. melanogaster testis revealed that the expression pattern of de novo genes was biased toward early spermatogenesis.

In humans, a study that identified 60 human-specific de novo genes found that their average expression, as measured by RNA-seq, was highest in the testes. Another study looking at mammalian-specific genes more generally also found enriched expression in the testes. Transcription in mammalian testes is thought to be particularly promiscuous, due in part to elevated expression of the transcription machinery and an open chromatin environment. Along with the immune-privileged nature of the testes, this promiscuous transcription is thought to create the ideal conditions for the expression of non-genic sequences required for de novo gene birth. Testes-specific expression seems to be a general feature of all novel genes, as an analysis of Drosophila and vertebrate species found that young genes showed testes-biased expression regardless of their mechanism of origination.

Preadaptation model

The preadaptation model of de novo gene birth uses mathematical modeling to show that when sequences that are normally hidden are exposed to weak or shielded selection, the resulting pool of “cryptic” sequences (i.e. proto-genes) can be purged of “self-evidently deleterious” variants, such as those prone to lead to protein aggregation, and thus enriched in potential adaptations relative to a completely non-expressed and unpurged set of sequences. This revealing and purging of cryptic deleterious non-genic sequences is a byproduct of pervasive transcription and translation of intergenic sequences, and is expected to facilitate the birth of functional de novo protein-coding genes. This is because by eliminating the most deleterious variants, what is left is, by a process of elimination, more likely to be adaptive than expected from random sequences. Using the evolutionary definition of function (i.e. that a gene is by definition under purifying selection against loss), the preadaptation model assumes that “gene birth is a sudden transition to functionality” that occurs as soon as an ORF acquires a net beneficial effect. In order to avoid being deleterious, newborn genes are expected to display exaggerated versions of genic features associated with the avoidance of harm. This is in contrast to the proto-gene model, which expects newborn genes to have features intermediate between old genes and non-genes.

The mathematics of the preadaptation model assume that the distribution of fitness effects is bimodal, with new sequences of mutations tending to break something or tinker, but rarely in between. Following this logic, populations may either evolve local solutions, in which selection operates on each individual locus and a relatively high error rate is maintained, or a global solution with a low error rate which permits the accumulation of deleterious cryptic sequences. De novo gene birth is thought to be favored in populations that evolve local solutions, as the relatively high error rate will result in a pool of cryptic variation that is “preadapted” through the purging of deleterious sequences. Local solutions are more likely in populations with a high effective population size.

In support of the preadaptation model, an analysis of ISD in mice and yeast found that young genes have higher ISD than old genes, while random non-genic sequences tend to show the lowest levels of ISD. Although the observed trend may have partly resulted from a subset of young genes derived by overprinting, higher ISD in young genes is also seen among overlapping viral gene pairs. With respect to other predicted structural features such as β-strand content and aggregation propensity, the peptides encoded by proto-genes are similar to non-genic sequences and categorically distinct from canonical genes.

Proto-gene model

This proto-gene model agrees with the preadaptation model about the importance of pervasive expression, and refers to the set of pervasively expressed sequences that do not meet all definitions of a gene as “proto-genes”. In contrast to the preadaptation model, the proto-gene model, suggests newborn genes have features intermediate between old genes and non-genes. Specifically this model envisages a more gradual process under selection from non-genic to genic state, rejecting the binary classification of gene and non-gene.

In an extension of the proto-gene model, it has been proposed that as proto-genes become more gene-like, their potential for adaptive change gives way to selected effects; thus, the predicted impact of mutations on fitness is dependent on the evolutionary status of the ORF. This notion is supported by the fact that overexpression of established ORFs in S. cerevisiae tends to be less beneficial (and more harmful) than does overexpression of emerging ORFs.

Several features of ORFs correlate with ORF age as determined by phylostratigraphic analysis, with young ORFs having properties intermediate between old ORFs and non-genes; this has been taken as evidence in favor of the proto-gene model, in which proto-gene state is a continuum . This evidence has been criticized, because the same apparent trends are also expected under a model in which identity as a gene is a binary. Under this model, when each age group contains a different ratio of genes vs. non-genes, Simpson's paradox can generate correlations in the wrong direction.

Grow slow and moult model

The “grow slow and moult” model describes a potential mechanism of de novo gene birth, particular to protein-coding genes. In this scenario, existing protein-coding ORFs expand at their ends, especially their 3’ ends, leading to the creation of novel N- and C-terminal domains. Novel C-terminal domains may first evolve under weak selection via occasional expression through read-through translation, as in the preadaptation model, only later becoming constitutively expressed through a mutation that disrupts the stop codon. Genes experiencing high translational readthrough tend to have intrinsically disordered C-termini. Furthermore, existing genes are often close to repetitive sequences that encode disordered domains. These novel, disordered domains may initially confer some non-specific binding capability that becomes gradually refined by selection. Sequences encoding these novel domains may occasionally separate from their parent ORF, leading or contributing to the creation of a de novo gene. Interestingly, an analysis of 32 insect genomes found that novel domains (i.e. those unique to insects) tend to evolve fairly neutrally, with only a few sites under positive selection, while their host proteins remain under purifying selection, suggesting that new functional domains emerge gradually and somewhat stochastically.

Escape from adaptive conflict

The evolutionary model escape from adaptive conflict (EAC) proposes a possible way for new gene duplication to be fixed: conflict due to contrasting function within a single gene drives the fixation of new duplication.

Human health

In addition to its significance for the field of evolutionary biology, de novo gene birth has implications for human health. It has been speculated that novel genes, including de novo genes, may play an outsized role in species-specific traits; however, many species-specific genes lack functional annotation. Nevertheless, there is evidence to suggest that human-specific de novo genes are involved in diseases such as cancer. NYCM, a de novo gene unique to humans and chimpanzees, regulates the pathogenesis of neuroblastomas in mouse models, and the primate-specific PART1, an lncRNA gene, has been identified as both a tumor suppressor and an oncogene in different contexts. Several other human- or primate-specific de novo genes, including PBOV1, GR6, MYEOV, ELFN1-AS1, and CLLU1, are also linked to cancer. Some have even suggested considering tumor-specifically expressed, evolutionary novel genes as their own class of genetic elements, noting that many such genes are under positive selection and may be neofunctionalized in the context of tumors.

The specific expression of many de novo genes in the human brain also raises the intriguing possibility that de novo genes influence human cognitive traits. One such example is FLJ33706, a de novo gene that was identified in GWAS and linkage analyses for nicotine addiction and shows elevated expression in the brains of Alzheimer's patients. Generally speaking, expression of young, primate-specific genes is enriched in the fetal human brain relative to the expression of similarly young genes in the mouse brain. Most of these young genes, several of which originated de novo, are expressed in the neocortex, which is thought to be responsible for many aspects of human-specific cognition. Many of these young genes show signatures of positive selection, and functional annotations indicate that they are involved in diverse molecular processes, but are enriched for transcription factors.

In addition to their roles in cancer processes, de novo originated human genes have been implicated in the maintenance of pluripotency and in immune function. The preferential expression of de novo genes in the testes is also suggestive of a role in reproduction. Given that the function of many de novo human genes remains uncharacterized, it seems likely that an appreciation of their contribution to human health and development will continue to grow.

Genome evolution

From Wikipedia, the free encyclopedia

Genome evolution is the process by which a genome changes in structure (sequence) or size over time. The study of genome evolution involves multiple fields such as structural analysis of the genome, the study of genomic parasites, gene and ancient genome duplications, polyploidy, and comparative genomics. Genome evolution is a constantly changing and evolving field due to the steadily growing number of sequenced genomes, both prokaryotic and eukaryotic, available to the scientific community and the public at large.

Circular representation of the Mycobacterium leprae genome created using JCVI online genome tools.

History

Since the first sequenced genomes became available in the late 1970s, scientists have been using comparative genomics to study the differences and similarities between various genomes. Genome sequencing has progressed over time to include more and more complex genomes including the eventual sequencing of the entire human genome in 2001. By comparing genomes of both close relatives and distant ancestors the stark differences and similarities between species began to emerge as well as the mechanisms by which genomes are able to evolve over time.

Prokaryotic and eukaryotic genomes

Prokaryotes

The principal forces of evolution in prokaryotes and their effects on archaeal and bacterial genomes. The horizontal line shows archaeal and bacterial genome size on a logarithmic scale (in megabase pairs) and the approximate corresponding number of genes (in parentheses).The effects of the main forces of prokaryotic genome evolution are denoted by triangles that are positioned, roughly, over the ranges of genome size for which the corresponding effects are thought to be most pronounced.

Prokaryotic genomes have two main mechanisms of evolution: mutation and horizontal gene transfer. A third mechanism, sexual reproduction, is prominent in eukaryotes and also occurs in bacteria. Prokaryotes can acquire novel genetic material through the process of bacterial conjugation in which both plasmids and whole chromosomes can be passed between organisms. An often cited example of this process is the transfer of antibiotic resistance utilizing plasmid DNA. Another mechanism of genome evolution is provided by transduction whereby bacteriophages introduce new DNA into a bacterial genome. The main mechanism of sexual interaction is natural genetic transformation which involves the transfer of DNA from one prokaryotic cell to another though the intervening medium. Transformation is a common mode of DNA transfer and at least 67 prokaryotic species are known to be competent for transformation.

Genome evolution in bacteria is well understood because of the thousands of completely sequenced bacterial genomes available. Genetic changes may lead to both increases or decreases of genomic complexity due to adaptive genome streamlining and purifying selection. In general, free-living bacteria have evolved larger genomes with more genes so they can adapt more easily to changing environmental conditions. By contrast, most parasitic bacteria have reduced genomes as their hosts supply many if not most nutrients, so that their genome does not need to encode for enzymes that produce these nutrients themselves.

Characteristic E.coli genome Human genome
Genome Size (base pairs) 4.6 Mb 3.2 Gb
Genome Structure Circular Linear
Number of chromosomes 1 46
Presence of Plasmids Yes No
Presence of Histones No Yes
DNA segregated in the nucleus No Yes
Number of genes 4,288 20,000
Presence of Introns No* Yes
Average Gene Size 700 bp 27,000 bp
E.coli largely contains only exons in genes. However, it does contain a small amount of self-splicing introns (Group II).

Eukaryotes

Eukaryotic genomes are generally larger than that of the prokaryotes. While the E. coli genome is roughly 4.6Mb in length, in comparison the Human genome is much larger with a size of approximately 3.2Gb. The eukaryotic genome is linear and can be composed of multiple chromosomes, packaged in the nucleus of the cell. The non-coding portions of the gene, known as introns, which are largely not present in prokaryotes, are removed by RNA splicing before translation of the protein can occur. Eukaryotic genomes evolve over time through many mechanisms including sexual reproduction which introduces much greater genetic diversity to the offspring than the usual prokaryotic process of replication in which the offspring are theoretically genetic clones of the parental cell.

Genome size

Genome size is usually measured in base pairs (or bases in single-stranded DNA or RNA). The C-value is another measure of genome size. Research on prokaryotic genomes shows that there is a significant positive correlation between the C-value of prokaryotes and the amount of genes that compose the genome. This indicates that gene number is the main factor influencing the size of the prokaryotic genome. In eukaryotic organisms, there is a paradox observed, namely that the number of genes that make up the genome does not correlate with genome size. In other words, the genome size is much larger than would be expected given the total number of protein coding genes.

Genome size can increase by duplication, insertion, or polyploidization. Recombination can lead to both DNA loss or gain. Genomes can also shrink because of deletions. A famous example for such gene decay is the genome of Mycobacterium leprae, the causative agent of leprosy. M. leprae has lost many once-functional genes over time due to the formation of pseudogenes. This is evident in looking at its closest ancestor Mycobacterium tuberculosis. M. leprae lives and replicates inside of a host and due to this arrangement it does not have a need for many of the genes it once carried which allowed it to live and prosper outside the host. Thus over time these genes have lost their function through mechanisms such as mutation causing them to become pseudogenes. It is beneficial to an organism to rid itself of non-essential genes because it makes replicating its DNA much faster and requires less energy.

An example of increasing genome size over time is seen in filamentous plant pathogens. These plant pathogen genomes have been growing larger over the years due to repeat-driven expansion. The repeat-rich regions contain genes coding for host interaction proteins. With the addition of more and more repeats to these regions the plants increase the possibility of developing new virulence factors through mutation and other forms of genetic recombination. In this way it is beneficial for these plant pathogens to have larger genomes.

Chromosomal evolution

Chromosome fusion, leading to a reduced number of chromosomes (here a fused human chromosome 2, with 2 separate chromosomes still present in chimpanzees and other apes).

The evolution of genomes can be impressively shown by the change of chromosome number and structure over time. For instance, the ancestral chromosomes corresponding to chimpanzee chromosomes 2A and 2B fused to produce human chromosome 2. Similarly, the chromosomes of more distantly related species show chromosomes that have been broken up into more parts over the course of evolution. This can be demonstrated by Fluorescence in situ hybridization.

Mechanisms

Gene duplication

Gene duplication is the process by which a region of DNA coding for a gene is duplicated. This can occur as the result of an error in recombination or through a retrotransposition event. Duplicate genes are often immune to the selective pressure under which genes normally exist. As a result, a large number of mutations may accumulate in the duplicate gene code. This may render the gene non-functional or in some cases confer some benefit to the organism.

Whole genome duplication

Similar to gene duplication, whole genome duplication is the process by which an organism's entire genetic information is copied, once or multiple times which is known as polyploidy. This may provide an evolutionary benefit to the organism by supplying it with multiple copies of a gene thus creating a greater possibility of functional and selectively favored genes. However, tests for enhanced rate and innovation in teleost fishes with duplicated genomes compared with their close relative holostean fishes (without duplicated genomes) found that there was little difference between them for the first 150 million years of their evolution.

In 1997, Wolfe & Shields gave evidence for an ancient duplication of the Saccharomyces cerevisiae (Yeast) genome. It was initially noted that this yeast genome contained many individual gene duplications. Wolfe & Shields hypothesized that this was actually the result of an entire genome duplication in the yeast's distant evolutionary history. They found 32 pairs of homologous chromosomal regions, accounting for over half of the yeast's genome. They also noted that although homologs were present, they were often located on different chromosomes. Based on these observations, they determined that Saccharomyces cerevisiae underwent a whole genome duplication soon after its evolutionary split from Kluyveromyces, a genus of ascomycetous yeasts. Over time, many of the duplicate genes were deleted and rendered non-functional. A number of chromosomal rearrangements broke the original duplicate chromosomes into the current manifestation of homologous chromosomal regions. This idea was further solidified in looking at the genome of yeast's close relative Ashbya gossypii. Whole genome duplication is common in fungi as well as plant species. An example of extreme genome duplication is represented by the Common Cordgrass (Spartina anglica) which is a dodecaploid, meaning that it contains 12 sets of chromosomes, in stark contrast to the human diploid structure in which each individual has only two sets of 23 chromosomes.

Transposable elements

Transposable elements are regions of DNA that can be inserted into the genetic code through one of two mechanisms. These mechanisms work similarly to "cut-and-paste" and "copy-and-paste" functionalities in word processing programs. The "cut-and-paste" mechanism works by excising DNA from one place in the genome and inserting itself into another location in the code. The "copy-and-paste" mechanism works by making a genetic copy or copies of a specific region of DNA and inserting these copies elsewhere in the code. The most common transposable element in the human genome is the Alu sequence, which is present in the genome over one million times.

Mutation

Spontaneous mutations often occur which can cause various changes in the genome. Mutations can either change the identity of one or more nucleotides, or result in the addition or deletion of one or more nucleotide bases. Such changes can lead to a frameshift mutation, causing the entire code to be read in a different order from the original, often resulting in a protein becoming non-functional. A mutation in a promoter region, enhancer region or transcription factor binding region can also result in either a loss of function, or an up or downregulation in the transcription of the gene targeted by these regulatory elements. Mutations are constantly occurring in an organism's genome and can cause either a negative effect, positive effect or neutral effect (no effect at all).

Pseudogenes

The proS loci in Mycobacterium leprae and M. tuberculosis, showing 3 pseudogenes (indicated by crosses) in M. leprae that still represent functional genes in M. tuberculosis. Homologous genes are indicated by identical colors and vertical, hatched bars. Modified after Cole et al. 2001.

Often a result of spontaneous mutation, pseudogenes are dysfunctional genes derived from previously functional gene relatives. There are many mechanisms by which a functional gene can become a pseudogene including the deletion or insertion of one or multiple nucleotides. This can result in a shift of reading frame, causing the gene to no longer code for the expected protein, introduce a premature stop codon or a mutation in the promoter region.

Often cited examples of pseudogenes within the human genome include the once functional olfactory gene families. Over time, many olfactory genes in the human genome became pseudogenes and were no longer able to produce functional proteins, explaining the poor sense of smell humans possess in comparison to their mammalian relatives.

Similarly, bacterial pseudogenes commonly arise from adaptation of free-living bacteria to parasitic lifestyles, so that many metabolic genes become superfluous as these species become adapted to their host. Once a parasite obtains nutrients (such as amino acids or vitamins) from its host it has no need to produce these nutrients itself and often loses the genes to make them.

Exon shuffling

Exon shuffling is a mechanism by which new genes are created. This can occur when two or more exons from different genes are combined or when exons are duplicated. Exon shuffling results in new genes by altering the current intron-exon structure. This can occur by any of the following processes: transposon mediated shuffling, sexual recombination or non-homologous recombination (also called illegitimate recombination). Exon shuffling may introduce new genes into the genome that can be either selected against and deleted or selectively favored and conserved.

Genome reduction and gene loss

Many species exhibit genome reduction when subsets of their genes are not needed anymore. This typically happens when organisms adapt to a parasitic life style, e.g. when their nutrients are supplied by a host. As a consequence, they lose the genes needed to produce these nutrients. In many cases, there are both free living and parasitic species that can be compared and their lost genes identified. Good examples are the genomes of Mycobacterium tuberculosis and Mycobacterium leprae, the latter of which has a dramatically reduced genome (see figure under pseudogenes above).

Another beautiful example are endosymbiont species. For instance, Polynucleobacter necessarius was first described as a cytoplasmic endosymbiont of the ciliate Euplotes aediculatus. The latter species dies soon after being cured of the endosymbiont. In the few cases in which P. necessarius is not present, a different and rarer bacterium apparently supplies the same function. No attempt to grow symbiotic P. necessarius outside their hosts has yet been successful, strongly suggesting that the relationship is obligate for both partners. Yet, closely related free-living relatives of P. necessarius have been identified. The endosymbionts have a significantly reduced genome when compared to their free-living relatives (1.56 Mbp vs. 2.16 Mbp).

Speciation

Cichlids such as Tropheops tropheops from Lake Malawi provide models for genome evolution.

A major question of evolutionary biology is how genomes change to create new species. Speciation requires changes in behavior, morphology, physiology, or metabolism (or combinations thereof). The evolution of genomes during speciation has been studied only very recently with the availability of next-generation sequencing technologies. For instance, cichlid fish in African lakes differ both morphologically and in their behavior. The genomes of 5 species have revealed that both the sequences but also the expression pattern of many genes has quickly changed over a relatively short period of time (100,000 to several million years). Notably, 20% of duplicate gene pairs have gained a completely new tissue-specific expression pattern, indicating that these genes also obtained new functions. Given that gene expression is driven by short regulatory sequences, this demonstrates that relatively few mutations are required to drive speciation. The cichlid genomes also showed increased evolutionary rates in microRNAs which are involved in gene expression.

Gene expression

Mutations can lead to changed gene function or, probably more often, to changed gene expression patterns. In fact, a study on 12 animal species provided strong evidence that tissue-specific gene expression was largely conserved between orthologs in different species. However, paralogs within the same species often have a different expression pattern. That is, after duplication of genes they often change their expression pattern, for instance by getting expressed in another tissue and thereby adopting new roles.

Composition of nucleotides (GC content)

The genetic code is made up of sequences of four nucleotide bases: Adenine, Guanine, Cytosine and Thymine, commonly referred to as A, G, C, and T. The GC-content is the percentage of G & C bases within a genome. GC-content varies greatly between different organisms. Gene coding regions have been shown to have a higher GC-content and the longer the gene is, the greater the percentage of G and C bases that are present. A higher GC-content confers a benefit because a Guanine-Cytosine bond is made up of three hydrogen bonds while an Adenine-Thymine bond is made up of only two. Thus the three hydrogen bonds give greater stability to the DNA strand. So, it is not surprising that important genes often have a higher GC-content than other parts of an organism's genome. For this reason, many species living at very high temperatures such as the ecosystems surrounding hydrothermal vents, have a very high GC-content. High GC-content is also seen in regulatory sequences such as promoters which signal the start of a gene. Many promoters contain CpG islands, areas of the genome where a cytosine nucleotide occurs next to a guanine nucleotide at a greater proportion. It has also been shown that a broad distribution of GC-content between species within a genus shows a more ancient ancestry. Since the species have had more time to evolve, their GC-content has diverged further apart.

Evolving translation of genetic code

Amino acids are made up of three base long codons and both Glycine and Alanine are characterized by codons with Guanine-Cytosine bonds at the first two codon base positions. This GC bond gives more stability to the DNA structure. It has been hypothesized that as the first organisms evolved in a high-heat and pressure environment they needed the stability of these GC bonds in their genetic code.

De novo origin of genes

Novel genes can arise from non-coding DNA. De novo origin of (protein-coding) genes only requires two features, namely the generation of an open reading frame, and the creation of a transcription factor binding site. For instance, Levine and colleagues reported the origin of five new genes in the D. melanogaster genome from noncoding DNA. Subsequently, de novo origin of genes has been also shown in other organisms such as yeast, rice and humans. For instance, Wu et al. (2011) reported 60 putative de novo human-specific genes all of which are short consisting of a single exon (except one). In bacteria, 'grounded' prophages (i.e. integrated phage that cannot produce new phage) are buffer zones which would tolerate variations thereby increasing the probability of de novo gene formation. These grounded prophages and other such genetic elements are sites where genes could be acquired through horizontal gene transfer (HGT).

Origin of life and the first genomes

In order to understand how the genome arose, knowledge is required of the chemical pathways that permit formation of the key building blocks of the genome under plausible prebiotic conditions. According to the RNA world hypothesis free-floating ribonucleotides were present in the primitive soup. These were the fundamental molecules that combined in series to form the original RNA genome. Molecules as complex as RNA must have arisen from small molecules whose reactivity was governed by physico-chemical processes. RNA is composed of purine and pyrimidine nucleotides, both of which are necessary for reliable information transfer, and thus Darwinian natural selection and evolution. Nam et al. demonstrated the direct condensation of nucleobases with ribose to give ribonucleosides in aqueous microdroplets, a key step leading to formation of the RNA genome. Also, a plausible prebiotic process for synthesizing pyrimidine and purine ribonucleotides leading to genome formation using wet-dry cycles was presented by Becker et al.

Operator (computer programming)

From Wikipedia, the free encyclopedia https://en.wikipedia.org/wiki/Operator_(computer_programmin...