https://en.wikipedia.org/wiki/Pan-genome
In the fields of molecular biology and genetics, a pan-genome (or supragenome) is the entire set of genes for all strains within a clade. The pan-genome includes: the core genome containing genes present in all strains within the clade, the accessory genome containing 'dispensable' genes present in a subset of the strains, and strain-specific genes. The study of the pan-genome is called pangenomics.
Some species have open (or extensive) pan-genomes, while others have closed pan-genomes. For species with a closed pan-genome, very few genes are added per sequenced genome (after sequencing many strains), and the size of the full pan-genome can be theoretically predicted. Species with an open pan-genome have enough genes added per additional sequenced genome that predicting the size of the full pan-genome is impossible. Population size and niche versatility have been suggested as the most influential factors in determining pan-genome size.
Pan-genomes were originally constructed for species of bacteria and archaea, but more recently eukaryotic pan-genomes have been developed, particularly for plant species. Plant studies have shown that pan-genome dynamics are linked to transposable elements. The significance of the pan-genome arises in an evolutionary context, especially with relevance to metagenomics, but is also used in a broader genomics context.
In the fields of molecular biology and genetics, a pan-genome (or supragenome) is the entire set of genes for all strains within a clade. The pan-genome includes: the core genome containing genes present in all strains within the clade, the accessory genome containing 'dispensable' genes present in a subset of the strains, and strain-specific genes. The study of the pan-genome is called pangenomics.
Some species have open (or extensive) pan-genomes, while others have closed pan-genomes. For species with a closed pan-genome, very few genes are added per sequenced genome (after sequencing many strains), and the size of the full pan-genome can be theoretically predicted. Species with an open pan-genome have enough genes added per additional sequenced genome that predicting the size of the full pan-genome is impossible. Population size and niche versatility have been suggested as the most influential factors in determining pan-genome size.
Pan-genomes were originally constructed for species of bacteria and archaea, but more recently eukaryotic pan-genomes have been developed, particularly for plant species. Plant studies have shown that pan-genome dynamics are linked to transposable elements. The significance of the pan-genome arises in an evolutionary context, especially with relevance to metagenomics, but is also used in a broader genomics context.
History
Etymology
The term ‘pan-genome’ was defined with its current meaning by Tettelin et al. in 2005; it derives 'pan' from the Greek word παν, meaning 'whole' or 'everything', while genome
is a commonly used term to describe an organism's complete genetic
material. Tettelin et al. applied the term specifically to bacteria,
whose pan-genome "includes a core genome containing genes present in all
strains and a dispensable genome composed of genes absent from one or
more strains and genes that are unique to each strain."
Original concept
The original pan-genome concept was developed by Tettelin et al. when they sequenced six strains of Streptococcus agalactiae
which could be described as a core genome shared by all isolates,
accounting for approximately 80% of any single genome, plus a
dispensable genome consisting of partially shared and strain-specific
genes. Extrapolation suggested that the gene reservoir in the S. agalactiae pan-genome is vast and that new unique genes would continue to be identified even after sequencing hundreds of genomes.
Examples
A similar pattern was found in Streptococcus pneumoniae
when 44 strains were sequenced (see figure). With each new genome
sequenced fewer new genes were discovered. In fact, the predicted number
of new genes dropped to zero when the number of genomes exceeds 50
(note, however, that this is not a pattern found in all species). This
would mean that S. pneumoniae has a 'closed pan-genome'. The main source of new genes in S. pneumoniae was Streptococcus mitis from which genes were transferred horizontally. The pan-genome size of S. pneumoniae
increased logarithmically with the number of strains and linearly with
the number of polymorphic sites of the sampled genomes, suggesting that
acquired genes accumulate proportionately to the age of clones.
Another example for the latter can be seen in a comparison of the sizes of the core and the pan-genome of Prochlorococcus. The core genome set is logically much smaller than the pan-genome, which is used by different ecotypes of Prochlorococcus. A 2015 study on Prevotella bacteria isolated from humans,
compared the gene repertoires of its species derived from different
body sites of human. It also reported an open pan- genome showing vast
diversity of gene pool.
Software tools
As interest in pan-genomes increased, there have been a number of software
tools developed to help analyze this kind of data. In 2015, a group
reviewed the different kinds of analyses and tools a researcher may have
available. There are seven kinds of analyses software developed to analyze pangenomes: cluster homologous genes; identify SNPs;
plot pangenomic profiles; build phylogenetic relationships of
orthologous genes/families of strains/isolates; function-based
searching; annotation and/or curation; and visualizations.
The two most cited software tools at the end of 2014 were Panseq and the pan-genomes analysis pipeline (PGAP). Other options include BPGA – A Pan-Genome Analysis Pipeline for prokaryotic genomes,
GET_HOMOLOGUES
, Roary and PanDelos.
A review focused on plant pan-genomes was published in 2015. The first software designed for plant pan-genomes was GET_HOMOLOGUES-EST.