Comparative genomics is a field of biological research in which the genomic features of different organisms are compared. The genomic features may include the DNA sequence, genes, gene order, regulatory sequences, and other genomic structural landmarks. In this branch of genomics, whole or large parts of genomes resulting from genome projects are compared to study basic biological similarities and differences as well as evolutionary relationships between organisms. The major principle of comparative genomics is that common features of two organisms will often be encoded within the DNA that is evolutionarily conserved between them. Therefore, comparative genomic approaches start with making some form of alignment of genome sequences and looking for orthologous sequences (sequences that share a common ancestry) in the aligned genomes and checking to what extent those sequences are conserved. Based on these, genome and molecular evolution are inferred and this may in turn be put in the context of, for example, phenotypic evolution or population genetics.
Virtually started as soon as the whole genomes of two organisms became available (that is, the genomes of the bacteria Haemophilus influenzae and Mycoplasma genitalium) in 1995, comparative genomics is now a standard component of the analysis of every new genome sequence. With the explosion in the number of genome projects due to the advancements in DNA sequencing technologies, particularly the next-generation sequencing
methods in late 2000s, this field has become more sophisticated, making
it possible to deal with many genomes in a single study. Comparative genomics has revealed high levels of similarity between closely related organisms, such as humans and chimpanzees, and, more surprisingly, similarity between seemingly distantly related organisms, such as humans and the yeast Saccharomyces cerevisiae. It has also showed the extreme diversity of the gene
composition in different evolutionary lineages.
History
Comparative genomics has a root in the comparison of virus genomes in the early 1980s. For example, small RNA viruses infecting animals (picornaviruses) and those infecting plants (cowpea mosaic virus) were compared and turned out to share significant sequence similarity and, in part, the order of their genes. In 1986, the first comparative genomic study at a larger scale was published, comparing the genomes of varicella-zoster virus and Epstein-Barr virus that contained more than 100 genes each.
The first complete genome sequence of a cellular organism, that of Haemophilus influenzae Rd, was published in 1995. The second genome sequencing paper was of the small parasitic bacterium Mycoplasma genitalium published in the same year. Starting from this paper, reports on new genomes inevitably became comparative-genomic studies.
The first high-resolution whole genome comparison system was
developed in 1998 by Art Delcher, Simon Kasif and Steven Salzberg and
applied to the comparison of entire highly related microbial organisms
with their collaborators at the Institute for Genomic Research (TIGR).
The system is called MUMMER
and was described in a publication in Nucleic Acids Research in 1999.
The system helps researchers to identify large rearrangements, single
base mutations, reversals, tandem repeat expansions and other
polymorphisms. In bacteria, MUMMER enables the identification of
polymorphisms that are responsible for virulence, pathogenicity, and
anti-biotic resistance. The system was also applied to the Minimal
Organism Project at TIGR and subsequently to many other comparative
genomics projects.
Saccharomyces cerevisiae, the baker's yeast, was the first eukaryote to have its complete genome sequence published in 1996. After the publication of the roundworm Caenorhabditis elegans genome in 1998 and together with the fruit fly Drosophila melanogaster genome in 2000, Gerald M. Rubin and his team published a paper titled "Comparative Genomics of the Eukaryotes", in which they compared the genomes of the eukaryotes D. melanogaster, C. elegans, and S. cerevisiae, as well as the prokaryote H. influenzae. At the same time, Bonnie Berger, Eric Lander, and their team published a paper on whole-genome comparison of human and mouse.
With the publication of the large genomes of vertebrates in the 2000s, including human, the Japanese pufferfish Takifugu rubripes, and mouse, precomputed results of large genome comparisons have been released for downloading or for visualization in a genome browser.
Instead of undertaking their own analyses, most biologists can access
these large cross-species comparisons and avoid the impracticality
caused by the size of the genomes.
Next-generation sequencing
methods, which were first introduced in 2007, have produced an enormous
amount of genomic data and have allowed researchers to generate
multiple (prokaryotic) draft genome sequences at once. These methods can
also quickly uncover single-nucleotide polymorphisms, insertions and deletions by mapping unassembled reads against a well annotated
reference genome, and thus provide a list of possible gene differences
that may be the basis for any functional variation among strains.
Evolutionary principles
One character of biology is evolution, evolutionary theory is also
the theoretical foundation of comparative genomics, and at the same time
the results of comparative genomics unprecedentedly enriched and
developed the theory of evolution. When two or more of the genome
sequence are compared, one can deduce the evolutionary relationships of
the sequences in a phylogenetic tree. Based on a variety of biological
genome data and the study of vertical and horizontal evolution
processes, one can understand vital parts of the gene structure and its
regulatory function.
Similarity of related genomes is the basis of comparative
genomics. If two creatures have a recent common ancestor, the
differences between the two species genomes are evolved from the
ancestors’ genome. The closer the relationship between two organisms,
the higher the similarities between their genomes. If there is close
relationship between them, then their genome will display a linear
behaviour (synteny),
namely some or all of the genetic sequences are conserved. Thus, the
genome sequences can be used to identify gene function, by analyzing
their homology (sequence similarity) to genes of known function.
Orthologous sequences are related sequences in different species:
a gene exists in the original species, the species divided into two
species, so genes in new species are orthologous to the sequence in the
original species. Paralogous sequences are separated by gene cloning
(gene duplication): if a particular gene in the genome is copied, then
the copy of the two sequences is paralogous to the original gene. A pair
of orthologous sequences is called orthologous pairs (orthologs), a
pair of paralogous sequence is called collateral pairs (paralogs).
Orthologous pairs usually have the same or similar function, which is
not necessarily the case for collateral pairs. In collateral pairs, the
sequences tend to evolve into having different functions.
Comparative genomics exploits both similarities and differences in the proteins, RNA, and regulatory regions of different organisms to infer how selection has acted upon these elements. Those elements that are responsible for similarities between different species should be conserved through time (stabilizing selection), while those elements responsible for differences among species should be divergent (positive selection).
Finally, those elements that are unimportant to the evolutionary
success of the organism will be unconserved (selection is neutral).
One of the important goals of the field is the identification of
the mechanisms of eukaryotic genome evolution. It is however often
complicated by the multiplicity of events that have taken place
throughout the history of individual lineages, leaving only distorted
and superimposed traces in the genome of each living organism. For this
reason comparative genomics studies of small model organisms (for example the model Caenorhabditis elegans and closely related Caenorhabditis briggsae) are of great importance to advance our understanding of general mechanisms of evolution.
Methods
Computational
approaches to genome comparison have recently become a common research
topic in computer science. A public collection of case studies and
demonstrations is growing, ranging from whole genome comparisons to gene
expression analysis.
This has increased the introduction of different ideas, including
concepts from systems and control, information theory, strings analysis
and data mining.
It is anticipated that computational approaches will become and remain a
standard topic for research and teaching, while multiple courses will
begin training students to be fluent in both topics.
Tools
Computational
tools for analyzing sequences and complete genomes are developing
quickly due to the availability of large amount of genomic data. At the
same time, comparative analysis tools are progressed and improved. In
the challenges about these analyses, it is very important to visualize
the comparative results.
Visualization of sequence conservation is a tough task of
comparative sequence analysis. As we know, it is highly inefficient to
examine the alignment of long genomic regions manually. Internet-based
genome browsers provide many useful tools for investigating genomic
sequences due to integrating all sequence-based biological information
on genomic regions. When we extract large amount of relevant biological
data, they can be very easy to use and less time-consuming.
- UCSC Browser: This site contains the reference sequence and working draft assemblies for a large collection of genomes.
- Ensembl: The Ensembl project produces genome databases for vertebrates and other eukaryotic species, and makes this information freely available online.
- MapView: The Map Viewer provides a wide variety of genome mapping and sequencing data.
- VISTA is a comprehensive suite of programs and databases for comparative analysis of genomic sequences. It was built to visualize the results of comparative analysis based on DNA alignments. The presentation of comparative data generated by VISTA can easily suit both small and large scale of data.
- BlueJay Genome Browser: a stand-alone visualization tool for the multi-scale viewing of annotated genomes and other genomic elements.
An advantage of using online tools is that these websites are being
developed and updated constantly. There are many new settings and
content can be used online to improve efficiency.
Applications
Agriculture
Agriculture is a field that reaps the benefits of comparative genomics. Identifying the loci
of advantageous genes is a key step in breeding crops that are
optimized for greater yield, cost-efficiency, quality, and disease
resistance. For example, one genome wide association study conducted on
517 rice landraces revealed 80 loci associated with several categories of agronomic performance, such as grain weight, amylose content, and drought tolerance. Many of the loci were previously uncharacterized.
Not only is this methodology powerful, it is also quick. Previous
methods of identifying loci associated with agronomic performance
required several generations of carefully monitored breeding of parent
strains, a time consuming effort that is unnecessary for comparative
genomic studies.
Medicine
The
medical field also benefits from the study of comparative genomics.
Vaccinology in particular has experienced useful advances in technology
due to genomic approaches to problems. In an approach known as reverse vaccinology,
researchers can discover candidate antigens for vaccine development by
analyzing the genome of a pathogen or a family of pathogens.
Applying a comparative genomics approach by analyzing the genomes of
several related pathogens can lead to the development of vaccines that
are multiprotective. A team of researchers employed such an approach to
create a universal vaccine for Group B Streptococcus, a group of bacteria responsible for severe neonatal infection.
Comparative genomics can also be used to generate specificity for
vaccines against pathogens that are closely related to commensal
microorganisms. For example, researchers used comparative genomic
analysis of commensal and pathogenic strains of E. coli to identify
pathogen specific genes as a basis for finding antigens that result in
immune response against pathogenic strains but not commensal ones.
In May of 2019, using the Global Genome Set, a team in the UK and
Australia sequenced thousands of globally-collected isolates of Group A Streptococcus, providing potential targets for developing a vaccine against the pathogen, also known as S. pyogenes.
Research
Comparative
genomics also opens up new avenues in other areas of research. As DNA
sequencing technology has become more accessible, the number of
sequenced genomes has grown. With the increasing reservoir of available
genomic data, the potency of comparative genomic inference has grown as
well. A notable case of this increased potency is found in recent
primate research. Comparative genomic methods have allowed researchers
to gather information about genetic variation, differential gene
expression, and evolutionary dynamics in primates that were
indiscernible using previous data and methods. The Great Ape Genome Project used comparative genomic methods to investigate genetic variation with reference to the six great ape species, finding healthy levels of variation in their gene pool despite shrinking population size.
Another study showed that patterns of DNA methylation, which are a
known regulation mechanism for gene expression, differ in the prefrontal
cortex of humans versus chimps, and implicated this difference in the
evolutionary divergence of the two species.