Search This Blog

Monday, April 28, 2025

Comparative genomics

From Wikipedia, the free encyclopedia
Whole genome alignment is a typical method in comparative genomics. This alignment of eight Yersinia bacteria genomes reveals 78 locally collinear blocks conserved among all eight taxa. Each chromosome has been laid out horizontally and homologous blocks in each genome are shown as identically colored regions linked across genomes. Regions that are inverted relative to Y. pestis KIM are shifted below a genome's center axis.

Comparative genomics is a branch of biological research that examines genome sequences across a spectrum of species, spanning from humans and mice to a diverse array of organisms from bacteria to chimpanzees. This large-scale holistic approach compares two or more genomes to discover the similarities and differences between the genomes and to study the biology of the individual genomes.[4] Comparison of whole genome sequences provides a highly detailed view of how organisms are related to each other at the gene level. By comparing whole genome sequences, researchers gain insights into genetic relationships between organisms and study evolutionary changes.[2] The major principle of comparative genomics is that common features of two organisms will often be encoded within the DNA that is evolutionarily conserved between them. Therefore, Comparative genomics provides a powerful tool for studying evolutionary changes among organisms, helping to identify genes that are conserved or common among species, as well as genes that give unique characteristics of each organism. Moreover, these studies can be performed at different levels of the genomes to obtain multiple perspectives about the organisms.[4]

The comparative genomic analysis begins with a simple comparison of the general features of genomes such as genome size, number of genes, and chromosome number. Table 1 presents data on several fully sequenced model organisms, and highlights some striking findings. For instance, while the tiny flowering plant Arabidopsis thaliana has a smaller genome than that of the fruit fly Drosophila melanogaster (157 million base pairs v. 165 million base pairs, respectively) it possesses nearly twice as many genes (25,000 v. 13,000). In fact, A. thaliana has approximately the same number of genes as humans (25,000). Thus, a very early lesson learned in the genomic era is that genome size does not correlate with evolutionary status, nor is the number of genes proportionate to genome size.[5]

Table 1: Comparative genome sizes of humans and other model organisms[2]
Organism Estimated size (base pairs) Chromosome number Estimated gene number
Human (Homo sapiens) 3.1 billion 46 25,000
Mouse (Mus musculus) 2.9 billion 40 25,000
Bovine (Bos taurus) 2.86 billion[6] 60[7] 22,000[8]
Fruit fly (Drosophila melanogater) 165 million 8 13,000
Plant (Arabidopsis thaliana) 157 million 10 25,000
Roundworm (Caenorhabditis elegans) 97 million 12 19,000
Yeast (Saccharomyces cerevisiae) 12 million 32 6,000
Bacteria (Escherichia coli) 4.6 million 1 3,200

In comparative genomics, synteny is the preserved order of genes on chromosomes of related species indicating their descent from a common ancestor. Synteny provides a framework in which the conservation of homologous genes and gene order is identified between genomes of different species.[9] Synteny blocks are more formally defined as regions of chromosomes between genomes that share a common order of homologous genes derived from a common ancestor.[10][11] Alternative names such as conserved synteny or collinearity have been used interchangeably.[12] Comparisons of genome synteny between and within species have provided an opportunity to study evolutionary processes that lead to the diversity of chromosome number and structure in many lineages across the tree of life;[13][14] early discoveries using such approaches include chromosomal conserved regions in nematodes and yeast,[15][16] evolutionary history and phenotypic traits of extremely conserved Hox gene clusters across animals and MADS-box gene family in plants,[17][18] and karyotype evolution in mammals and plants.[19]

Furthermore, comparing two genomes not only reveals conserved domains or synteny but also aids in detecting copy number variations, single nucleotide polymorphisms (SNPs), indels, and other genomic structural variations.

Virtually started as soon as the whole genomes of two organisms became available (that is, the genomes of the bacteria Haemophilus influenzae and Mycoplasma genitalium) in 1995, comparative genomics is now a standard component of the analysis of every new genome sequence.[2][20] With the explosion in the number of genome projects due to the advancements in DNA sequencing technologies, particularly the next-generation sequencing methods in late 2000s, this field has become more sophisticated, making it possible to deal with many genomes in a single study.[21] Comparative genomics has revealed high levels of similarity between closely related organisms, such as humans and chimpanzees, and, more surprisingly, similarity between seemingly distantly related organisms, such as humans and the yeast Saccharomyces cerevisiae.[22] It has also showed the extreme diversity of the gene composition in different evolutionary lineages.[20]

History

See also: History of genomics

Comparative genomics has a root in the comparison of virus genomes in the early 1980s.[20] For example, small RNA viruses infecting animals (picornaviruses) and those infecting plants (cowpea mosaic virus) were compared and turned out to share significant sequence similarity and, in part, the order of their genes.[23] In 1986, the first comparative genomic study at a larger scale was published, comparing the genomes of varicella-zoster virus and Epstein-Barr virus that contained more than 100 genes each.[24]

The first complete genome sequence of a cellular organism, that of Haemophilus influenzae Rd, was published in 1995.[25] The second genome sequencing paper was of the small parasitic bacterium Mycoplasma genitalium published in the same year.[26] Starting from this paper, reports on new genomes inevitably became comparative-genomic studies.[20]

Microbial genomes. The first high-resolution whole genome comparison system of microbial genomes of 10-15kbp was developed in 1998 by Art Delcher, Simon Kasif and Steven Salzberg and applied to the comparison of entire highly related microbial organisms with their collaborators at the Institute for Genomic Research (TIGR). The system is called MUMMER and was described in a publication in Nucleic Acids Research in 1999. The system helps researchers to identify large rearrangements, single base mutations, reversals, tandem repeat expansions and other polymorphisms. In bacteria, MUMMER enables the identification of polymorphisms that are responsible for virulence, pathogenicity, and anti-biotic resistance. The system was also applied to the Minimal Organism Project at TIGR and subsequently to many other comparative genomics projects.

Eukaryote genomes. Saccharomyces cerevisiae, the baker's yeast, was the first eukaryote to have its complete genome sequence published in 1996.[27] After the publication of the roundworm Caenorhabditis elegans genome in 1998[15] and together with the fruit fly Drosophila melanogaster genome in 2000,[28] Gerald M. Rubin and his team published a paper titled "Comparative Genomics of the Eukaryotes", in which they compared the genomes of the eukaryotes D. melanogaster, C. elegans, and S. cerevisiae, as well as the prokaryote H. influenzae.[29] At the same time, Bonnie Berger, Eric Lander, and their team published a paper on whole-genome comparison of human and mouse.[30]

With the publication of the large genomes of vertebrates in the 2000s, including human, the Japanese pufferfish Takifugu rubripes, and mouse, precomputed results of large genome comparisons have been released for downloading or for visualization in a genome browser. Instead of undertaking their own analyses, most biologists can access these large cross-species comparisons and avoid the impracticality caused by the size of the genomes.[31]

Next-generation sequencing methods, which were first introduced in 2007, have produced an enormous amount of genomic data and have allowed researchers to generate multiple (prokaryotic) draft genome sequences at once. These methods can also quickly uncover single-nucleotide polymorphisms, insertions and deletions by mapping unassembled reads against a well annotated reference genome, and thus provide a list of possible gene differences that may be the basis for any functional variation among strains.[21]

Evolutionary principles

One character of biology is evolution, evolutionary theory is also the theoretical foundation of comparative genomics, and at the same time the results of comparative genomics unprecedentedly enriched and developed the theory of evolution. When two or more of the genome sequence are compared, one can deduce the evolutionary relationships of the sequences in a phylogenetic tree. Based on a variety of biological genome data and the study of vertical and horizontal evolution processes, one can understand vital parts of the gene structure and its regulatory function.

Similarity of related genomes is the basis of comparative genomics. If two creatures have a recent common ancestor, the differences between the two species genomes are evolved from the ancestors' genome. The closer the relationship between two organisms, the higher the similarities between their genomes. If there is close relationship between them, then their genome will display a linear behaviour (synteny), namely some or all of the genetic sequences are conserved. Thus, the genome sequences can be used to identify gene function, by analyzing their homology (sequence similarity) to genes of known function.

Human FOXP2 gene and evolutionary conservation is shown in and multiple alignment (at bottom of figure) in this image from the UCSC Genome Browser. Note that conservation tends to cluster around coding regions (exons).

Orthologous sequences are related sequences in different species: a gene exists in the original species, the species divided into two species, so genes in new species are orthologous to the sequence in the original species. Paralogous sequences are separated by gene cloning (gene duplication): if a particular gene in the genome is copied, then the copy of the two sequences is paralogous to the original gene. A pair of orthologous sequences is called orthologous pairs (orthologs), a pair of paralogous sequence is called collateral pairs (paralogs). Orthologous pairs usually have the same or similar function, which is not necessarily the case for collateral pairs. In collateral pairs, the sequences tend to evolve into having different functions.

Comparative genomics exploits both similarities and differences in the proteins, RNA, and regulatory regions of different organisms to infer how selection has acted upon these elements. Those elements that are responsible for similarities between different species should be conserved through time (stabilizing selection), while those elements responsible for differences among species should be divergent (positive selection). Finally, those elements that are unimportant to the evolutionary success of the organism will be unconserved (selection is neutral).

One of the important goals of the field is the identification of the mechanisms of eukaryotic genome evolution. It is however often complicated by the multiplicity of events that have taken place throughout the history of individual lineages, leaving only distorted and superimposed traces in the genome of each living organism. For this reason comparative genomics studies of small model organisms (for example the model Caenorhabditis elegans and closely related Caenorhabditis briggsae) are of great importance to advance our understanding of general mechanisms of evolution.[32][33]

Role of CNVs in evolution

Comparative genomics plays a crucial role in identifying copy number variations (CNVs) and understanding their significance in evolution. CNVs, which involve deletions or duplications of large segments of DNA, are recognized as a major source of genetic diversity, influencing gene structure, dosage, and regulation. While single nucleotide polymorphisms (SNPs) are more common, CNVs impact larger genomic regions and can have profound effects on phenotype and diversity.[34] Recent studies suggest that CNVs constitute around 4.8–9.5% of the human genome and have a substantial functional and evolutionary impact. In mammals, CNVs contribute significantly to population diversity, influencing gene expression and various phenotypic traits.[35] Comparative genomics analyses of human and chimpanzee genomes have revealed that CNVs may play a greater role in evolutionary change compared to single nucleotide changes. Research indicates that CNVs affect more nucleotides than individual base-pair changes, with about 2.7% of the genome affected by CNVs compared to 1.2% by SNPs. Moreover, while many CNVs are shared between humans and chimpanzees, a significant portion is unique to each species. Additionally, CNVs have been associated with genetic diseases in humans, highlighting their importance in human health. Despite this, many questions about CNVs remain unanswered, including their origin and contributions to evolutionary adaptation and disease. Ongoing research aims to address these questions using techniques like comparative genomic hybridization, which allows for a detailed examination of CNVs and their significance. When investigators examined the raw sequence data of the human and chimpanzee.[36]

Significance of comparative genomics

Comparative genomics holds profound significance across various fields, including medical research, basic biology, and biodiversity conservation. For instance, in medical research, predicting how genomic variants limited ability to predict which genomic variants lead to changes in organism-level phenotypes, such as increased disease risk in humans, remains challenging due to the immense size of the genome, comprising about three billion nucleotides.[37][38][39]

To tackle this challenge, comparative genomics offers a solution by pinpointing nucleotide positions that have remained unchanged over millions of years of evolution. These conserved regions indicate potential sites where genetic alterations could have detrimental effects on an organism's fitness, thus guiding the search for disease-causing variants. Moreover, comparative genomics holds promise in unraveling the mechanisms of gene evolution, environmental adaptations, gender-specific differences, and population variations across vertebrate lineages.[40]

Furthermore, comparative studies enable the identification of genomic signatures of selection—regions in the genome that have undergone preferential increase and fixation in populations due to their functional significance in specific processes.[41] For instance, in animal genetics, indigenous cattle exhibit superior disease resistance and environmental adaptability but lower productivity compared to exotic breeds. Through comparative genomic analyses, significant genomic signatures responsible for these unique traits can be identified. Using insights from this signature, breeders can make informed decisions to enhance breeding strategies and promote breed development.[42]

Methods

Computational approaches are necessary for genome comparisons, given the large amount of data encoded in genomes. Many tools are now publicly available, ranging from whole genome comparisons to gene expression analysis.[43] This includes approaches from systems and control, information theory, string analysis and data mining.[44] Computational approaches will remain critical for research and teaching, especially when information science and genome biology is taught in conjunction.[45]

Phylogenetic tree of descendant species and reconstructed ancestors. The branch color represents breakpoint rates in RACFs (breakpoints per million years). Black branches represent nondetermined breakpoint rates. Tip colors depict assembly contiguity: black, scaffold-level genome assembly; green, chromosome-level genome assembly; yellow, chromosome-scale scaffold-level genome assembly. Numbers next to species names indicate diploid chromosome number (if known).[46]

Comparative genomics starts with basic comparisons of genome size and gene density. For instance, genome size is important for coding capacity and possibly for regulatory reasons. High gene density facilitates genome annotation, analysis of environmental selection. By contrast, low gene density hampers the mapping of genetic disease as in the human genome.

Sequence alignment

Alignments are used to capture information about similar sequences such as ancestry, common evolutionary descent, or common structure and function. Alignments can be done for both nucleotide and protein sequences.[47][48] Alignments consist of local or global pairwise alignments, and multiple sequence alignments. One way to find global alignments is to use a dynamic programming algorithm known as Needleman-Wunsch algorithmwhereas Smith–Waterman algorithm used to find local alignments. With the exponential growth of sequence databases and the emergence of longer sequences, there's a heightened interest in faster, approximate, or heuristic alignment procedures. Among these, the FASTA and BLAST algorithms are prominent for local pairwise alignment. Recent years have witnessed the development of programs tailored to aligning lengthy sequences, such as MUMmer (1999), BLASTZ (2003), and AVID (2003). While BLASTZ adopts a local approach, MUMmer and AVID are geared towards global alignment. To harness the benefits of both local and global alignment approaches, one effective strategy involves integrating them. Initially, a rapid variant of BLAST known as BLAT is employed to identify homologous "anchor" regions. These anchors are subsequently scrutinized to identify sets exhibiting conserved order and orientation. Such sets of anchors are then subjected to alignment using a global strategy.

Additionally, ongoing efforts focus on optimizing existing algorithms to handle the vast amount of genome sequence data by enhancing their speed. Furthermore, MAVID stands out as another noteworthy pairwise alignment program specifically designed for aligning multiple genomes.

Pairwise Comparison: The Pairwise comparison of genomic sequence data is widely utilized in comparative gene prediction. Many studies in comparative functional genomics lean on pairwise comparisons, wherein traits of each gene are compared with traits of other genes across species. his method yields many more comparisons than unique observations, making each comparison dependent on others.[49][50]

Multiple comparisons: The comparison of multiple genomes is a natural extension of pairwise inter-specific comparisons. Such comparisons typically aim to identify conserved regions across two phylogenetic scales: 1. Deep comparisons, often referred to as phylogenetic footprinting[51] reveal conservation across higher taxonomic units like vertebrates.[52] 2. Shallow comparisons, recently termed Phylogenetic shadowing,[53] probe conservation across a group of closely related species.

Chromosome by chromosome variation of indicine and taurine cattle. The genomic structural differences on chromosome X between indicine (Bos indicus Nelore cattle) and taurine cattle (Bos taurusHereford cattle) were identified using the SyRI tool.

Whole-genome alignment

Whole-genome alignment (WGA) involves predicting evolutionary relationships at the nucleotide level between two or more genomes. It integrates elements of colinear sequence alignment and gene orthology prediction, presenting a greater challenge due to the vast size and intricate nature of whole genomes. Despite its complexity, numerous methods have emerged to tackle this problem because WGAs play a crucial role in various genome-wide analyses, such as phylogenetic inference, genome annotation, and function prediction.[54] Thereby, SyRI (Synteny and Rearrangement Identifier) is one such method that utilizes whole genome alignment and it is designed to identify both structural and sequence differences between two whole-genome assemblies. By taking WGAs as input, SyRI initially scans for disparities in genome structures. Subsequently, it identifies local sequence variations within both rearranged and non-rearranged (syntenic) regions.[55]

Example of a phylogenetic tree created from an alignment of 250 unique spike protein sequences from the Betacoronavirus family.

Phylogenetic reconstruction

Another computational method for comparative genomics is phylogenetic reconstruction. It is used to describe evolutionary relationships in terms of common ancestors. The relationships are usually represented in a tree called a phylogenetic tree. Similarly, coalescent theory is a retrospective model to trace alleles of a gene in a population to a single ancestral copy shared by members of the population. This is also known as the most recent common ancestor. Analysis based on coalescence theory tries predicting the amount of time between the introduction of a mutation and a particular allele or gene distribution in a population. This time period is equal to how long ago the most recent common ancestor existed. The inheritance relationships are visualized in a form similar to a phylogenetic tree. Coalescence (or the gene genealogy) can be visualized using dendrograms.[56]

Example of synteny block and break. Genes located on chromosomes of two species are denoted in letters. Each gene is associated with a number representing the species they belong to (species 1 or 2). Orthologous genes are connected by dashed lines and genes without an orthologous relationship are treated as gaps in synteny programs.[57]

Genome maps

An additional method in comparative genomics is genetic mapping. In genetic mapping, visualizing synteny is one way to see the preserved order of genes on chromosomes. It is usually used for chromosomes of related species, both of which result from a common ancestor.[58] This and other methods can shed light on evolutionary history. A recent study used comparative genomics to reconstruct 16 ancestral karyotypes across the mammalian phylogeny. The computational reconstruction showed how chromosomes rearranged themselves during mammal evolution. It gave insight into conservation of select regions often associated with the control of developmental processes. In addition, it helped to provide an understanding of chromosome evolution and genetic diseases associated with DNA rearrangements.[citation needed]

Solid green squares indicate mammalian chromosomes maintained as a single synteny block (either as a single chromosome or fused with another MAM), with shades of the color indicating the fraction of the chromosome affected by intra-chromosomal rearrangements (the lightest shade is most affected). Split blocks demarcate mammalian chromosomes affected by inter-chromosomal rearrangements. Upper (green)triangles show the fraction of the chromosome affected by intra chromosomal rearrangements, and lower (red) triangles show the fraction affected by inter chromosomal rearrangements. Syntenic relationships of each MAM to the human genome are given at the right of the diagram. MAMX appears split in goat because its X chromosome is assembled as two separate fragments. BOR, boreoeutherian ancestor chromosome; EUA, Euarchontoglires ancestor chromo-some; EUC, Euarchonta ancestor chromosome; EUT, eutherian ancestor chromosome; PMT; Primatomorpha ancestor chromosome; PRT, primates (Hominidae) ancestor chromosome; THE, therian ancestor chromosome.
Image from the study Evolution of the ancestral mammalian karyotype and syntenic regions. It is a Visualization of the evolutionary history of reconstructed mammalian chromosomes based on the human lineage.[46]

Tools

Computational tools for analyzing sequences and complete genomes are developing quickly due to the availability of large amount of genomic data. At the same time, comparative analysis tools are progressed and improved. In the challenges about these analyses, it is very important to visualize the comparative results.[59]

Visualization of sequence conservation is a tough task of comparative sequence analysis. As we know, it is highly inefficient to examine the alignment of long genomic regions manually. Internet-based genome browsers provide many useful tools for investigating genomic sequences due to integrating all sequence-based biological information on genomic regions. When we extract large amount of relevant biological data, they can be very easy to use and less time-consuming.[59]

  • UCSC Browser: This site contains the reference sequence and working draft assemblies for a large collection of genomes.[60]
  • Ensembl: The Ensembl project produces genome databases for vertebrates and other eukaryotic species, and makes this information freely available online.[61]
  • MapView: The Map Viewer provides a wide variety of genome mapping and sequencing data.[62]
  • VISTA is a comprehensive suite of programs and databases for comparative analysis of genomic sequences. It was built to visualize the results of comparative analysis based on DNA alignments. The presentation of comparative data generated by VISTA can easily suit both small and large scale of data.[63]
  • BlueJay Genome Browser: A stand-alone visualization tool for the multi-scale viewing of annotated genomes and other genomic elements.[64]
  • SyRI: SyRI stands for Synteny and Rearrangement Identifier and is a versatile tool for comparative genomics, offering functionalities for synteny analysis and visualization, aiding in the prediction of genomic differences between related genomes using whole-genome assemblies (WGA).[65]
  • Synmap2: Specifically designed for synteny mapping, Synmap2 efficiently compares genetic maps or assemblies, providing insights into genome evolution and rearrangements among related organisms.[66]
  • GSAlign: GSAlign facilitates accurate alignment of genomic sequences, particularly useful for large-scale comparative genomics studies, enabling researchers to identify similarities and differences across genomes.[67]
  • IGV (Integrative Genomics Viewer): A widely-used tool for visualizing and analyzing genomic data, IGV supports comparative genomics by enabling users to explore alignments, variants, and annotations across multiple genomes.[68]
  • Manta: Manta is a rapid structural variant caller, crucial for comparative genomics as it detects genomic rearrangements such as insertions, deletions, inversions, and duplications, aiding in understanding genetic variation among populations or species.[69]
  • CNVNatar: CNVNatar specializes in detecting copy number variations (CNVs), which are crucial in understanding genome evolution and population genetics, providing insights into genomic structural changes across different organisms.[70]
  • PIPMaker: PIPMaker facilitates the alignment and comparison of two genomic sequences, enabling the identification of conserved regions, duplications, and evolutionary breakpoints, aiding in comparative genomics analyses.[71]
  • GLASS (Genome-wide Location and Sequence Searcher): GLASS is a tool for identifying conserved regulatory elements across genomes, crucial for comparative genomics studies focusing on understanding gene regulation and evolution.[72]
  • PatternHunter: PatternHunter is a versatile tool for sequence analysis, offering functionalities for identifying conserved patterns, motifs, and repeats across genomic sequences, aiding in comparative genomics studies of gene families and regulatory elements.
  • Mummer: Mummer is a suite of tools for whole-genome alignment and comparison, widely used in comparative genomics for identifying similarities, differences, and evolutionary events among genomes at various scales.[73]

An advantage of using online tools is that these websites are being developed and updated constantly. There are many new settings and content can be used online to improve efficiency.[59]

Selected applications

Agriculture

Agriculture is a field that reaps the benefits of comparative genomics. Identifying the loci of advantageous genes is a key step in breeding crops that are optimized for greater yield, cost-efficiency, quality, and disease resistance. For example, one genome wide association study conducted on 517 rice landraces revealed 80 loci associated with several categories of agronomic performance, such as grain weight, amylose content, and drought tolerance. Many of the loci were previously uncharacterized.[74] Not only is this methodology powerful, it is also quick. Previous methods of identifying loci associated with agronomic performance required several generations of carefully monitored breeding of parent strains, a time-consuming effort that is unnecessary for comparative genomic studies.[75]

Medicine

Vaccine development

The medical field also benefits from the study of comparative genomics. In an approach known as reverse vaccinology, researchers can discover candidate antigens for vaccine development by analyzing the genome of a pathogen or a family of pathogens.[76] Applying a comparative genomics approach by analyzing the genomes of several related pathogens can lead to the development of vaccines that are multi-protective. A team of researchers employed such an approach to create a universal vaccine for Group B Streptococcus, a group of bacteria responsible for severe neonatal infection.[77] Comparative genomics can also be used to generate specificity for vaccines against pathogens that are closely related to commensal microorganisms. For example, researchers used comparative genomic analysis of commensal and pathogenic strains of E. coli to identify pathogen-specific genes as a basis for finding antigens that result in immune response against pathogenic strains but not commensal ones.[78] In May 2019, using the Global Genome Set, a team in the UK and Australia sequenced thousands of globally-collected isolates of Group A Streptococcus, providing potential targets for developing a vaccine against the pathogen, also known as S. pyogenes.[79]

Personalized Medicine

Personalized Medicine, enabled by Comparative Genomics, represents a revolutionary approach in healthcare, tailoring medical treatment and disease prevention to the individual patient's genetic makeup.[80] By analyzing genetic variations across populations and comparing them with an individual's genome, clinicians can identify specific genetic markers associated with disease susceptibility, drug metabolism, and treatment response. By identifying genetic variants associated with drug metabolism pathways, drug targets, and adverse reactions, personalized medicine can optimize medication selection, dosage, and treatment regimens for individual patients. This approach minimizes the risk of adverse drug reactions, enhances treatment efficacy, and improves patient outcomes.

Cancer

Cancer Genomics represents a cutting-edge field within oncology that leverages comparative genomics to revolutionize cancer diagnosis, treatment, and prevention strategies. Comparative genomics plays a crucial role in cancer research by identifying driver mutations, and providing comprehensive analyses of mutations, copy number alterations, structural variants, gene expression, and DNA methylation profiles in large-scale studies across different cancer types. By analyzing the genomes of cancer cells and comparing them with healthy cells, researchers can uncover key genetic alterations driving tumorigenesis, tumor progression, and metastasis. This deep understanding of the genomic landscape of cancer has profound implications for precision oncology. Moreover, Comparative Genomics is instrumental in elucidating mechanisms of drug resistance—a major challenge in cancer treatment.

TCR loci from humans (H, top) and mice (M, bottom) are compared, with TCR elements in red, non-TCR genes in purple, and V segments in orange, other TCR elements in red. M6A, a putative methyltransferase; ZNF, a zinc-finger protein; OR, olfactory receptor genes; DAD1, defender against cell death; The sites of species-specific, processed pseudogenes are shown by gray triangles. See also GenBank accession numbers AE000658-62. Modified after Glusman et al. 2001.[81]

Mouse models in immunology

T cells (also known as a T lymphocytes or a thymocytes) are immune cells that grow from stem cells in the bone marrow. They assist to defend the body from infection and may aid in the fight against cancer. Because of their morphological, physiological, and genetic resemblance to humans, mice and rats have long been the preferred species for biomedical research animal models. Comparative Medicine Research is built on the ability to use information from one species to understand the same processes in another. We can get new insights into molecular pathways by comparing human and mouse T cells and their effects on the immune system utilizing comparative genomics. In order to comprehend its TCRs and their genes, Glusman conducted research on the sequencing of the human and mouse T cell receptor loci. TCR genes are well-known and serve as a significant resource for supporting functional genomics and understanding how genes and intergenic regions of the genome contribute to biological processes.[81]

T-cell immune receptors are important in seeing the world of pathogens in the cellular immune system. One of the reasons for sequencing the human and mouse TCR loci was to match the orthologous gene family sequences and discover conserved areas using comparative genomics. These, it was thought, would reflect two sorts of biological information: (1) exons and (2) regulatory sequences. In fact, the majority of V, D, J, and C exons could be identified in this method. The variable regions are encoded by multiple unique DNA elements that are rearranged and connected during T cell (TCR) differentiation: variable (V), diversity (D), and joining (J) elements for the and polypeptides; and V and J elements for the and polypeptides.[Figure 1] However, several short noncoding conserved blocks of the genome had been shown. Both human and mouse motifs are largely clustered in the 200 bp [Figure 2], the known 3′ enhancers in the TCR/ were identified, and a conserved region of 100 bp in the mouse J intron was subsequently shown to have a regulatory function.

[Figure 2] Gene structure of the human (top) and mouse (bottom) V, D, J, and C gene segments. The arrows represent the transcriptional direction of each TCR gene. The squares and circles represent going in a direct and reverse direction. Modified after Glusman et al. 2001.[81]

Comparisons of the genomic sequences within each physical site or location of a specific gene on a chromosome (locs) and across species allow for research on other mechanisms and other regulatory signals. Some suggest new hypotheses about the evolution of TCRs, to be tested (and improved) by comparison to the TCR gene complement of other vertebrate species. A comparative genomic investigation of humans and mice will obviously allow for the discovery and annotation of many other genes, as well as identifying in other species for regulatory sequences.[81]

Research

Comparative genomics also opens up new avenues in other areas of research. As DNA sequencing technology has become more accessible, the number of sequenced genomes has grown. With the increasing reservoir of available genomic data, the potency of comparative genomic inference has grown as well.

A notable case of this increased potency is found in recent primate research. Comparative genomic methods have allowed researchers to gather information about genetic variation, differential gene expression, and evolutionary dynamics in primates that were indiscernible using previous data and methods.[82]

Great Ape Genome Project

The Great Ape Genome Project used comparative genomic methods to investigate genetic variation with reference to the six great ape species, finding healthy levels of variation in their gene pool despite shrinking population size.[83] Another study showed that patterns of DNA methylation, which are a known regulation mechanism for gene expression, differ in the prefrontal cortex of humans versus chimps, and implicated this difference in the evolutionary divergence of the two species.

Neuroepigenetics

From Wikipedia, the free encyclopedia

Neuroepigenetics is the study of how epigenetic changes to genes affect the nervous system. These changes may effect underlying conditions such as addiction, cognition, and neurological development.

Mechanisms

Neuroepigenetic mechanisms regulate gene expression in the neuron. Often, these changes take place due to recurring stimuli. Neuroepigenetic mechanisms involve proteins or protein pathways that regulate gene expression by adding, editing or reading epigenetic marks such as methylation or acetylation. Some of these mechanisms include ATP-dependent chromatin remodeling, LINE1, and prion protein-based modifications. Other silencing mechanisms include the recruitment of specialized proteins that methylate DNA such that the core promoter element is inaccessible to transcription factors and RNA polymerase. As a result, transcription is no longer possible. One such protein pathway is the REST co-repressor complex pathway. There are also several non-coding RNAs that regulate neural function at the epigenetic level. These mechanisms, along with neural histone methylation, affect arrangement of synapses, neuroplasticity, and play a key role in learning and memory.

Methylation

DNA methyltransferases (DNMTs) are involved in regulation of the electrophysiological landscape of the brain through methylation of CpGs. Several studies have shown that inhibition or depletion of DNMT1 activity during neural maturation leads to hypomethylation of the neurons by removing the cell's ability to maintain methylation marks in the chromatin. This gradual loss of methylation marks leads to changes in the expression of crucial developmental genes that may be dosage sensitive, leading to neural degeneration. This was observed in the mature neurons in the dorsal portion of the mouse prosencephalon, where there was significantly greater amounts of neural degeneration and poor neural signaling in the absence of DNMT1. Despite poor survival rates amongst the DNMT1-depleted neurons, some of the cells persisted throughout the lifespan of the organism. The surviving cells reaffirmed that the loss of DNMT1 led to hypomethylation in the neural cell genome. These cells also exhibited poor neural functioning. In fact, a global loss of neural functioning was also observed in these model organisms, with the greatest amounts neural degeneration occurring in the prosencephalon.

Other studies showed a trend for DNMT3a and DNMT3b. However, these DNMT's add new methyl marks on unmethylated DNA, unlike DNMT1. Like DNMT1, the loss of DNMT3a and 3b resulted in neuromuscular degeneration two months after birth, as well as poor survival rates amongst the progeny of the mutant cells, even though DNMT3a does not regularly function to maintain methylation marks. This conundrum was addressed by other studies which recorded rare loci in mature neurons where DNMT3a acted as a maintenance DNMT. The Gfap locus, which codes for the formation and regulation of the cytoskeleton of astrocytes, is one such locus where this activity is observed. The gene is regularly methylated to downregulate glioma related cancers. DNMT inhibition leads to decreased methylation and increased synaptic activity. Several studies show that the methylation-related increase or decrease in synaptic activity occurs due to the upregulation or downregulation of receptors at the neurological synapse. Such receptor regulation plays a major role in many important mechanisms, such as the 'fight or flight' response. The glucocorticoid receptor (GR) is the most studied of these receptors. During stressful circumstances, there is a signaling cascade that begins from the pituitary gland and terminates due to a negative feedback loop from the adrenal gland. In this loop, the increase in the levels of the stress response hormone results in the increase of GR. Increase in GR results in the decrease of cellular response to the hormone levels. It has been shown that methylation of the I7 exon within the GR locus leads to a lower level of basal GR expression in mice. These mice were more susceptible to high levels of stress as opposed to mice with lower levels of methylation at the I7 exon. Up-regulation or down-regulation of receptors through methylation leads to change in synaptic activity of the neuron.

Hypermethylation, CpG islands, and tumor suppressing genes

CpG Islands (CGIs) are regulatory elements that can influence gene expression by allowing or interfering with transcription initiation or enhancer activity. CGIs are generally interspersed with the promoter regions of the genes they affect and may also affect more than one promoter region. In addition they may also include enhancer elements and be separate from the transcription start site. Hypermethylation at key CGIs can effectively silence expression of tumor suppressing genes and is common in gliomas. Tumor suppressing genes are those which inhibit a cell's progression towards cancer. These genes are commonly associated with important functions which regulate cell-cycle events. For example, PI3K and p53 pathways are affected by CGI promoter hypermethylation, this includes the promoters of the genes CDKN2/p16, RB, PTEN, TP53 and p14ARF. Importantly, glioblastomas are known to have high frequency of methylation at CGIs/promoter sites. For example, Epithelial Membrane Protein 3 (EMP3) is a gene which is involved in cell proliferation as well as cellular interactions. It is also thought to function as a tumor suppressor, and in glioblastomas is shown to be silenced via hypermethylation. Furthermore, introduction of the gene into EMP3-silenced neuroblasts results in reduced colony formation as well as suppressed tumor growth. In contrast, hypermethylation of promoter sites can also inhibit activity of oncogenes and prevent tumorigenesis. Such oncogenic pathways as the transformation growth factor (TGF)-beta signaling pathway stimulate cells to proliferate. In glioblastomas the overactivity of this pathway is associated with aggressive forms of tumor growth. Hypermethylation of PDGF-B, the TGF-beta target, inhibits uncontrolled proliferation.

Hypomethylation and aberrant histone modification

Global reduction in methylation is implicated in tumorigenesis. More specifically, wide spread CpG demethylation, contributing to global hypomethylation, is known to cause genomic instability leading to development of tumors. An important effect of this DNA modification is its transcriptional activation of oncogenes. For example, expression of MAGEA1 enhanced by hypomethylation interferes with p53 function.

Aberrant patterns of histone modifications can also take place at specific loci and ultimately manipulate gene activity. In terms of CGI promoter sites, methylation and loss of acetylation occurs frequently at H3K9. Furthermore, H3K9 dimethylation and trimethylation are repressive marks which, along with bivalent differentially methylated domains, are hypothesized to make tumor suppressing genes more susceptible to silencing. Abnormal presence or lack of methylation in glioblastomas are strongly linked to genes which regulate apoptosis, DNA repair, cell proliferation, and tumor suppression. One of the best known examples of genes affected by aberrant methylation that contributes to formation of glioblastomas is MGMT, a gene involved in DNA repair which encodes the protein O6-methylguanine-DNA methyltransferase. Methylation of the MGMT promoter is an important predictor of the effectiveness of alkylating agents to target glioblastomas. Hypermethylation of the MGMT promoter causes transcriptional silencing and is found in several cancer types including glioma, lymphoma, breast cancer, prostate cancer, and retinoblastoma.

Neuroplasticity

Neuroplasticity refers to the ability of the brain to undergo synaptic rearrangement as a response to recurring stimuli. Neurotrophin proteins play a major role in synaptic rearrangement, amongst other factors. Depletion of neurotrophin BDNF or BDNF signaling is one of the main factors in developing diseases such as Alzheimer's disease, Huntington's disease, and depression. Neuroplasticity can also occur as a consequence of targeted epigenetic modifications such as methylation and acetylation. Exposure to certain recurring stimuli leads to demethylation of particular loci and remethylation in a pattern that leads to a response to that particular stimulus. Like the histone readers, erasers and writers also modify histones by removing and adding modifying marks respectively. An eraser, neuroLSD1, is a modified version of the original Lysine Demethylase 1(LSD1) that exists only in neurons and assists with neuronal maturation. Although both versions of LSD1 share the same target, their expression patterns are vastly different and neuroLSD1 is a truncated version of LSD1. NeuroLSD1 increases the expression of immediate early genes (IEGs) involved in cell maturation. Recurring stimuli lead to differential expression of neuroLSD1, leading to rearrangement of loci. The eraser is also thought to play a major role in the learning of many complex behaviors and is way through which genes interact with the environment.

Neurodegenerative diseases

Alzheimer's disease

Alzheimer's disease (AD) is a neurodegenerative disease known to progressively affect memory and incite cognitive degradation. Epigenetic modifications both globally and on specific candidate genes are thought to contribute to the etiology of this disease. Immunohistochemical analysis of post-mortem brain tissues across several studies have revealed global decreases in both 5-methylcytosine (5mC) and 5-hydroxymethylcytosine (5hmC) in AD patients compared with controls. However, conflicting evidence has shown elevated levels of these epigenetic markers in the same tissues. Furthermore, these modifications appear to be affected early on in tissues associated with the pathophysiology of AD. The presence of 5mC at the promoters of genes is generally associated with gene silencing. 5hmC, which is the oxidized product of 5mC, via ten-eleven-translocase (TET), is thought to be associated with activation of gene expression, though the mechanisms underlying this activation are not fully understood.

Regardless of variations in results of methylomic analysis across studies, it is known that the presence of 5hmC increases with differentiation and aging of cells in the brain. Furthermore, genes which have a high prevalence of 5hmC are also implicated in the pathology of other age related neurodegenerative diseases, and are key regulators of ion transport, neuronal development, and cell death. For example, over-expression of 5-Lipoxygenase (5-LOX), an enzyme which generates pro-inflammatory mediators from arachidonic acid, in AD brains is associated with high prevalence of 5hmC at the 5-LOX gene promoter region.

Amyotrophic Lateral Sclerosis

DNA modifications at different transcriptional sites have been shown to contribute to neurodegenerative diseases. These include harmful transcriptional alterations such as those found in motor neuron functionality associated with Amyotrophic Lateral Sclerosis (ALS). Degeneration of upper and lower motor neurons, which contributes to muscle atrophy in ALS patients, is linked to chromatin modifications among a group of key genes. One important site that is regulated by epigenetic events is the hexanucleotide repeat expansion in C9orf72 within the chromosome 9p21. Hypermethylation of the C9orf72 related CpG Islands is shown to be associated with repeat expansion in ALS affected tissues. Overall, silencing of the C9orf72 gene may result in haploinsufficiency, and may therefore influence the presentation of disease. The activity of chromatin modifiers is also linked to prevalence of ALS. DNMT3A is an important methylating agent and has been shown to be present throughout the central nervous systems of those with ALS. Furthermore, over-expression of this de novo methyl transferase is also implicated in cell death of motor-neuron analogs.

Mutations in the FUS gene, that encodes an RNA/DNA binding protein, are causally linked to ALS. ALS patients with such mutations have increased levels of DNA damage. The protein encoded by the FUS gene is employed in the DNA damage response. It is recruited to DNA double-strand breaks and catalyzes recombinational repair of such breaks. In response to DNA damage, the FUS protein also interacts with histone deacetylase I, a protein employed in epigenetic alteration of histones. This interaction is necessary for efficient DNA repair. These findings suggest that defects in epigenetic signalling and DNA repair contribute to the pathogenesis of ALS.

Neuro-oncology

A multitude of genetic and epigenetic changes in DNA profiles in brain cells are thought to be linked to tumorgenesis. These alterations, along with changes in protein functions, are shown to induce uncontrolled cell proliferation, expansion, and metastasis. While genetic events such as deletions, translocations, and amplification give rise to activation of oncogenes and deactivation of tumor suppressing genes, epigenetic changes silence or up-regulate these same genes through key chromatin modifications.

Neurotoxicity

Neurotoxicity refers to damage made to the central or peripheral nervous systems due to chemical, biological, or physical exposure to toxins. Neurotoxicity can occur at any age and its effects may be short-term or long-term, depending on the mechanism of action of the neurotoxin and degree of exposure.

Certain metals are considered essential due to their role in key biochemical and physiological pathways, while the remaining metals are characterized as being nonessential. Nonessential metals do not serve a purpose in any biological pathway and the presence and accumulation in the brain of most can lead to neurotoxicity. These nonessential metals, when found inside the body compete with essential metals for binding sites, upset antioxidant balance, and their accumulation in the brain can lead to harmful side effects, such as depression and intellectual disability. An increase in nonessential heavy metal concentrations in air, water and food sources, and household products has increased the risk of chronic exposure.

Acetylation, methylation and histone modification are some of the most common epigenetic markers. While these changes do not directly affect the DNA sequence, they are able to alter the accessibility to genetic components, such as the promoter or enhancer regions, necessary for gene expression. Studies have shown that long-term maternal exposure to lead (Pb) contributes to decreased methylation in areas of the fetal epigenome, for example the interspaced repetitive sequences (IRSs) Alu1 and LINE-1. The hypomethylation of these IRSs has been linked to increased risk for cancers and autoimmune diseases later in life. Additionally, studies have found a relationship between chronic prenatal Pb exposure and neurological diseases, such as Alzheimer's and schizophrenia, as well as developmental issues. Furthermore, the acetylation and methylation changes induced by overexposure to lead result in decreased neurogenesis and neuron differentiation ability, and consequently interfere with early brain development.

Overexposure to essential metals can also have detrimental consequences on the epigenome. For example, when manganese, a metal normally used by the body as a cofactor, is present at high concentrations in the blood it can negatively affect the central nervous system. Studies have shown that accumulation of manganese leads to dopaminergic cell death and consequently plays a role in the onset of Parkinson's disease (PD). A hallmark of Parkinson's disease is the accumulation of α-Synuclein in the brain. Increased exposure to manganese leads to the downregulation of protein kinase C delta (PKCδ) through decreased acetylation and results in the misfolding of the α-Synuclein protein that allows aggregation and triggers apoptosis of dopaminergic cells.

Research

The field has only recently seen a growth in interest, as well as in research, due to technological advancements that facilitate better resolution of the minute modifications made to DNA. However, even with the significant advances in technology, studying the biology of neurological phenomena, such as cognition and addiction, comes with its own set of challenges. Biological study of cognitive processes, especially with humans, has many ethical caveats. Some procedures, such as brain biopsies of people with Rett syndrome, usually call for a fresh tissue sample that can only be extricated from the brain of deceased individual. In such cases, the researchers have no control over the age of brain tissue sample, thereby limiting research options. In case of addiction to substances such as alcohol, researchers utilize mouse models to mirror the human version of the disease. However, the mouse models are administered greater volumes of ethanol than humans normally consume in order to obtain more prominent phenotypes. Therefore, while the model organism and the tissue samples provide an accurate approximation of the biology of neurological phenomena, these approaches do not provide a complete and precise picture of the exact processes underlying a phenotype or a disease.

Neuroepigenetics had also remained underdeveloped due to the controversy surrounding the classification of genetic modifications in matured neurons as epigenetic phenomena. This discussion arises due to the fact that neurons do not undergo mitosis after maturation, yet the conventional definition of epigenetic phenomena emphasizes heritable changes passed on from parent to offspring. However, various histone modifications are placed by epigenetic modifiers such as DNA methyltransferases (DNMT) in neurons and these marks regulate gene expression throughout the life span of the neuron. The modifications heavily influence gene expression and arrangement of synapses within the brain. Finally, although not inherited, most of these marks are maintained throughout the life of the cell once they are placed on chromatin.

Open educational resources

From Wikipedia, the free encyclopedia https://en.wikipedia.org/wiki/Open_educational_resources UN...