Nucleic acid secondary structure is the basepairing interactions within a single nucleic acid polymer or between two polymers. It can be represented as a list of bases which are paired in a nucleic acid molecule.
The secondary structures of biological DNAs and RNAs tend to be different: biological DNA mostly exists as fully base paired
double helices, while biological RNA is single stranded and often forms
complex and intricate base-pairing interactions due to its increased
ability to form hydrogen bonds stemming from the extra hydroxyl group in the ribose sugar.
In a non-biological context, secondary structure is a vital consideration in the nucleic acid design of nucleic acid structures for DNA nanotechnology and DNA computing, since the pattern of basepairing ultimately determines the overall structure of the molecules.
Fundamental concepts
Base pairing
Top, an AT base pair demonstrating two intermolecular hydrogen bonds; bottom, a GC base pair demonstrating three intermolecular hydrogen bonds.
In molecular biology, two nucleotides on opposite complementaryDNA or RNA strands that are connected via hydrogen bonds are called a base pair (often abbreviated bp). In the canonical Watson-Crick base pairing, adenine (A) forms a base pair with thymine (T) and guanine (G) forms one with cytosine (C) in DNA. In RNA, thymine is replaced by uracil (U). Alternate hydrogen bonding patterns, such as the wobble base pair and Hoogsteen base pair, also occur—particularly in RNA—giving rise to complex and functional tertiary structures. Importantly, pairing is the mechanism by which codons on messenger RNA molecules are recognized by anticodons on transfer RNA during protein translation.
Some DNA- or RNA-binding enzymes can recognize specific base pairing
patterns that identify particular regulatory regions of genes.
Hydrogen bonding
is the chemical mechanism that underlies the base-pairing rules
described above. Appropriate geometrical correspondence of hydrogen bond
donors and acceptors allows only the "right" pairs to form stably. DNA
with high GC-content is more stable than DNA with low GC-content, but contrary to popular belief, the hydrogen bonds do not stabilize the DNA significantly and stabilization is mainly due to stacking interactions.
The larger nucleobases, adenine and guanine, are members of a class of doubly ringed chemical structures called purines;
the smaller nucleobases, cytosine and thymine (and uracil), are members
of a class of singly ringed chemical structures called pyrimidines.
Purines are only complementary with pyrimidines: pyrimidine-pyrimidine
pairings are energetically unfavorable because the molecules are too far
apart for hydrogen bonding to be established; purine-purine pairings
are energetically unfavorable because the molecules are too close,
leading to overlap repulsion. The only other possible pairings are GT
and AC; these pairings are mismatches because the pattern of hydrogen
donors and acceptors do not correspond. The GU wobble base pair, with two hydrogen bonds, does occur fairly often in RNA.
Hybridization is the process of complementarybase pairs binding to form a double helix.
Melting is the process by which the interactions between the strands
of the double helix are broken, separating the two nucleic acid strands.
These bonds are weak, easily separated by gentle heating, enzymes, or physical force. Melting occurs preferentially at certain points in the nucleic acid. T and A rich sequences are more easily melted than C and G rich regions. Particular base steps are also susceptible to DNA melting, particularly T A and T G base steps. These mechanical features are reflected by the use of sequences such as TATAA at the start of many genes to assist RNA polymerase in melting the DNA for transcription.
Strand separation by gentle heating, as used in PCR,
is simple providing the molecules have fewer than about 10,000 base
pairs (10 kilobase pairs, or 10 kbp). The intertwining of the DNA
strands makes long segments difficult to separate. The cell avoids this
problem by allowing its DNA-melting enzymes (helicases) to work concurrently with topoisomerases, which can chemically cleave the phosphate backbone of one of the strands so that it can swivel around the other. Helicases unwind the strands to facilitate the advance of sequence-reading enzymes such as DNA polymerase.
Nucleic acid secondary structure is generally divided into helices
(contiguous base pairs), and various kinds of loops (unpaired
nucleotides surrounded by helices). Frequently these elements, or
combinations of them, are further classified into additional categories
including, for example, tetraloops, pseudoknots, and stem-loops.
The double helix is an important tertiary structure
in nucleic acid molecules which is intimately connected with the
molecule's secondary structure. A double helix is formed by regions of
many consecutive base pairs.
The nucleic acid double helix is a spiral polymer, usually right-handed, containing two nucleotide strands which base pair
together. A single turn of the helix constitutes about ten
nucleotides, and contains a major groove and minor groove, the major
groove being wider than the minor groove.
Given the difference in widths of the major groove and minor groove,
many proteins which bind to DNA do so through the wider major groove. Many double-helical forms are possible; for DNA the three biologically relevant forms are A-DNA, B-DNA, and Z-DNA, while RNA double helices have structures similar to the A form of DNA.
The secondary structure of nucleic acid molecules can often be uniquely decomposed into stems and loops. The stem-loop
structure (also often referred to as an "hairpin"), in which a
base-paired helix ends in a short unpaired loop, is extremely common and
is a building block for larger structural motifs such as cloverleaf
structures, which are four-helix junctions such as those found in transfer RNA.
Internal loops (a short series of unpaired bases in a longer paired
helix) and bulges (regions in which one strand of a helix has "extra"
inserted bases with no counterparts in the opposite strand) are also
frequent.
An RNA pseudoknot structure. For example, the RNA component of human telomerase.A pseudoknot is a nucleic acid secondary structure containing at least two stem-loop structures in which half of one stem is intercalated between the two halves of another stem. Pseudoknots fold into knot-shaped three-dimensional conformations but are not true topological knots. The base pairing
in pseudoknots is not well nested; that is, base pairs occur that
"overlap" one another in sequence position. This makes the presence of
general pseudoknots in nucleic acid sequences impossible to predict by the standard method of dynamic programming,
which uses a recursive scoring system to identify paired stems and
consequently cannot detect non-nested base pairs with common algorithms.
However, limited subclasses of pseudoknots can be predicted using
modified dynamic programs.
Newer structure prediction techniques such as stochastic context-free grammars are also unable to consider pseudoknots.
Pseudoknots can form a variety of structures with catalytic activity
and several important biological processes rely on RNA molecules that
form pseudoknots. For example, the RNA component of the human telomerase contains a pseudoknot that is critical for its activity. The hepatitis delta virus ribozyme is a well known example of a catalytic RNA with a pseudoknot in its active site.Though DNA can also form pseudoknots, they are generally not present in standard physiological conditions.
Most methods for nucleic acid secondary structure prediction rely on a nearest neighbor thermodynamic model. A common method to determine the most probable structures given a sequence of nucleotides makes use of a dynamic programming algorithm that seeks to find structures with low free energy. Dynamic programming algorithms often forbid pseudoknots,
or other cases in which base pairs are not fully nested, as considering
these structures becomes computationally very expensive for even small
nucleic acid molecules. Other methods, such as stochastic context-free grammars can also be used to predict nucleic acid secondary structure.
For many RNA molecules, the secondary structure is highly
important to the correct function of the RNA — often more so than the
actual sequence. This fact aids in the analysis of non-coding RNA sometimes termed "RNA genes". One application of bioinformatics uses predicted RNA secondary structures in searching a genome for noncoding but functional forms of RNA. For example, microRNAs have canonical long stem-loop structures interrupted by small internal loops.
RNA secondary structure applies in RNA splicing in certain species. In humans and other tetrapods, it has been shown that without the U2AF2 protein, the splicing process is inhibited. However, in zebrafish and other teleosts the RNA splicing
process can still occur on certain genes in the absence of U2AF2. This
may be because 10% of genes in zebrafish have alternating TG and AC base
pairs at the 3' splice site (3'ss) and 5' splice site (5'ss)
respectively on each intron, which alters the secondary structure of the
RNA. This suggests that secondary structure of RNA can influence
splicing, potentially without the use of proteins like U2AF2 that have
been thought to be required for splicing to occur.
Secondary structure determination
RNA secondary structure can be determined from atomic coordinates (tertiary structure) obtained by X-ray crystallography, often deposited in the Protein Data Bank. Current methods include 3DNA/DSSR and MC-annotate.
The nucleic acid notation currently in use was first formalized by the International Union of Pure and Applied Chemistry (IUPAC) in 1970.
This universally accepted notation uses the Roman characters G, C, A,
and T, to represent the four nucleotides commonly found in deoxyribonucleic acids (DNA).
Given the rapidly expanding role for genetic sequencing,
synthesis, and analysis in biology, some researchers have developed
alternate notations to further support the analysis and manipulation of
genetic data. These notations generally exploit size, shape, and
symmetry to accomplish these objectives.
Degenerate base symbols in biochemistry are an IUPAC representation for a position on a DNA sequence that can have multiple possible alternatives. These should not be confused with non-canonical bases
because each particular sequence will have in fact one of the regular
bases. These are used to encode the consensus sequence of a population
of aligned sequences and are used for example in phylogenetic analysis to summarise into one multiple sequences or for BLAST searches, even though IUPAC degenerate symbols are masked (as they are not coded).
Under the commonly used IUPAC system, nucleobases are represented by the first letters of their chemical names: guanine, cytosine, adenine, and thymine. This shorthand also includes eleven "ambiguity" characters associated with every possible combination of the four DNA bases. The ambiguity characters were designed to encode positional variations in order to report DNA sequencing errors, consensus sequences, or single-nucleotide polymorphisms. The IUPAC notation, including ambiguity characters and suggested mnemonics, is shown in Table 1.
Despite its broad and nearly universal acceptance, the IUPAC
system has a number of limitations, which stem from its reliance on the
Roman alphabet. The poor legibility of upper-case Roman characters,
which are generally used when displaying genetic data, may be chief
among these limitations. The value of external projections in
distinguishing letters has been well documented.
However, these projections are absent from upper case letters, which in
some cases are only distinguishable by subtle internal cues. Take for
example the upper case C and G used to represent cytosine and guanine.
These characters generally comprise half the characters in a genetic
sequence but are differentiated by a small internal tick (depending on
the typeface). Nevertheless, these Roman characters are available in the
ASCII character set most commonly used in textual communications, which reinforces this system's ubiquity.
Another shortcoming of the IUPAC notation arises from the fact
that its eleven ambiguity characters have been selected from the
remaining characters of the Roman alphabet. The authors of the notation
endeavored to select ambiguity characters with logical mnemonics. For
example, S is used to represent the possibility of finding cytosine or
guanine at genetic loci, both of which form strong cross-strand binding interactions. Conversely, the weaker
interactions of thymine and adenine are represented by a W. However,
convenient mnemonics are not as readily available for the other
ambiguity characters displayed in Table 1. This has made ambiguity
characters difficult to use and may account for their limited
application.
The positions of the carbons in the ribose
sugar that forms the backbone of the nucleic acid chain are numbered,
and are used to indicate the direction of nucleic acids (5'->3'
versus 3'->5'). This is referred to as directionality.
Alternative visually enhanced notations
Legibility
issues associated with IUPAC-encoded genetic data have led biologists
to consider alternative strategies for displaying genetic data. These
creative approaches to visualizing DNA sequences have generally relied
on the use of spatially distributed symbols and/or visually distinct
shapes to encode lengthy nucleic acid sequences. Alternative notations
for nucleotide sequences have been attempted, however general uptake has
been low. Several of these approaches are summarized below.
Stave projection
The Stave Projection uses spatially distributed dots to enhance the legibility of DNA sequences.
In 1986, Cowin et al. described a novel method for visualizing DNA sequence known as the Stave Projection.
Their strategy was to encode nucleotides as circles on series of
horizontal bars akin to notes on musical stave. As illustrated in Figure
1, each gap on the five-line staff corresponded to one of the four DNA
bases. The spatial distribution of the circles made it far easier to
distinguish individual bases and compare genetic sequences than
IUPAC-encoded data.
The order of the bases (from top to bottom, G, A, T, C) is chosen
so that the complementary strand can be read by turning the projection
upside down.
Geometric symbols
Zimmerman et al. took a different approach to visualizing genetic data.
Rather than relying on spatially distributed circles to highlight
genetic features, they exploited four geometrically diverse symbols
found in a standard computer font to distinguish the four bases. The
authors developed a simple WordPerfect macro to translate IUPAC
characters into the more visually distinct symbols.
DNA Skyline
With
the growing availability of font editors, Jarvius and Landegren devised
a novel set of genetic symbols, known as the DNA Skyline font, which
uses increasingly taller blocks to represent the different DNA bases. While reminiscent of Cowin et al.'s
spatially distributed Stave Projection, the DNA Skyline font is easy to
download and permits translation to and from the IUPAC notation by
simply changing the font in most standard word processing applications.
Ambigraphic notations
AmbiScript uses ambigrams to reflect DNA symmetries and support the manipulation and analysis of genetic data.
Ambigrams
(symbols that convey different meaning when viewed in a different
orientation) have been designed to mirror structural symmetries found in
the DNA double helix.
By assigning ambigraphic characters to complementary bases (i.e.
guanine: b, cytosine: q, adenine: n, and thymine: u), it is possible to
complement DNA sequences by simply rotating the text 180 degrees.
An ambigraphic nucleic acid notation also makes it easy to identify
genetic palindromes, such as endonuclease restriction sites, as sections
of text that can be rotated 180 degrees without changing the sequence.
One example of an ambigraphic
nucleic acid notation is AmbiScript, a rationally designed nucleic acid
notations that combined many of the visual and functional features of
its predecessors.
Its notation also uses spatially offset characters to facilitate the
visual review and analysis of genetic data. AmbiScript was also designed
to indicate ambiguous nucleotide positions via compound symbols. This
strategy aimed to offer a more intuitive solution to the use of
ambiguity characters first proposed by the IUPAC.
As with Jarvius and Landegren's DNA Skyline fonts, AmbiScript fonts can
be downloaded and applied to IUPAC-encoded sequence data.
Triple Helix Base Pairing
Watson and Crick base pairs are indicated by a "•" or a "-" or a "." (example: A•T, or poly(rC)•2poly(rC)).
Hoogsteen triple helix base pairs are indicated by a "*" or a ":" (example: C•G*G+, or T•A*T, or C•G*G, or T•A*A).
The original SMILES specification was initiated in the 1980s. It has since been modified and extended. In 2007, an open standard called OpenSMILES was developed in the open-source chemistry community.
History
The original SMILES specification was initiated by David Weininger at the USEPA Mid-Continent Ecology Division Laboratory in Duluth in the 1980s. Acknowledged for their parts in the early development were "Gilman Veith and Rose Russo (USEPA) and Albert Leo and Corwin Hansch (Pomona College)
for supporting the work, and Arthur Weininger (Pomona; Daylight CIS)
and Jeremy Scofield (Cedar River Software, Renton, WA) for assistance in
programming the system." The Environmental Protection Agency funded the initial project to develop SMILES.
In July 2006, the IUPAC introduced the InChI
as a standard for formula representation. SMILES is generally
considered to have the advantage of being more human-readable than
InChI; it also has a wide base of software support with extensive
theoretical backing (such as graph theory).
Terminology
The
term SMILES refers to a line notation for encoding molecular structures
and specific instances should strictly be called SMILES strings.
However, the term SMILES is also commonly used to refer to both a single
SMILES string and a number of SMILES strings; the exact meaning is
usually apparent from the context. The terms "canonical" and "isomeric"
can lead to some confusion when applied to SMILES. The terms describe
different attributes of SMILES strings and are not mutually exclusive.
Typically, a number of equally valid SMILES strings can be written for a molecule. For example, CCO, OCC and C(O)C all specify the structure of ethanol.
Algorithms have been developed to generate the same SMILES string for a
given molecule; of the many possible strings, these algorithms choose
only one of them. This SMILES is unique for each structure, although
dependent on the canonicalization
algorithm used to generate it, and is termed the canonical SMILES.
These algorithms first convert the SMILES to an internal representation
of the molecular structure; an algorithm then examines that structure
and produces a unique SMILES string. Various algorithms for generating
canonical SMILES have been developed and include those by Daylight
Chemical Information Systems, OpenEye Scientific Software, MEDIT, Chemical Computing Group, MolSoft LLC, and the Chemistry Development Kit. A common application of canonical SMILES is indexing and ensuring uniqueness of molecules in a database.
The original paper that described the CANGEN
algorithm claimed to generate unique SMILES strings for graphs
representing molecules, but the algorithm fails for a number of simple
cases (e.g. cuneane, 1,2-dicyclopropylethane) and cannot be considered a correct method for representing a graph canonically. There is currently no systematic comparison across commercial software to test if such flaws exist in those packages.
SMILES notation allows the specification of configuration at tetrahedral centers,
and double bond geometry. These are structural features that cannot be
specified by connectivity alone, and therefore SMILES which encode this
information are termed isomeric SMILES. A notable feature of these
rules is that they allow rigorous partial specification of chirality.
The term isomeric SMILES is also applied to SMILES in which isomers are specified.
Graph-based definition
In terms of a graph-based computational procedure, SMILES is a string obtained by printing the symbol nodes encountered in a depth-firsttree traversal of a chemical graph. The chemical graph is first trimmed to remove hydrogen atoms and cycles are broken to turn it into a spanning tree.
Where cycles have been broken, numeric suffix labels are included to
indicate the connected nodes. Parentheses are used to indicate points of
branching on the tree.
The resultant SMILES form depends on the choices:
of the bonds chosen to break cycles,
of the starting atom used for the depth-first traversal, and
of the order in which branches are listed when encountered.
SMILES definition as strings of a context-free language
From
the view point of a formal language theory, SMILES is a word. A SMILES
is parsable with a context-free parser. The use of this representation
has been in the prediction of biochemical properties (incl. toxicity and
biodegradability)
based on the main principle of chemoinformatics that similar molecules
have similar properties. The predictive models implemented a syntactic
pattern recognition approach (which involved defining a molecular
distance) as well as a more robust scheme based on statistical pattern recognition.
Description
Atoms
Atoms are represented by the standard abbreviation of the chemical elements, in square brackets, such as [Au] for gold. Brackets may be omitted in the common case of atoms which:
are in the "organic subset" of B, C, N, O, P, S, F, Cl, Br, or I, and
have the number of hydrogens attached implied by the SMILES valence
model (typically their normal valence, but for N and P it is 3 or 5, and
for S it is 2, 4 or 6), and
All other elements must be enclosed in brackets, and have charges and hydrogens shown explicitly. For instance, the SMILES for water may be written as either O or [OH2]. Hydrogen may also be written as a separate atom; water may also be written as [H]O[H].
When brackets are used, the symbol H is added if the
atom in brackets is bonded to one or more hydrogen, followed by the
number of hydrogen atoms if greater than 1, then by the sign + for a positive charge or by - for a negative charge. For example, [NH4+] for ammonium (NH+ 4).
If there is more than one charge, it is normally written as digit;
however, it is also possible to repeat the sign as many times as the ion
has charges: one may write either [Ti+4] or [Ti++++] for titanium(IV) Ti4+. Thus, the hydroxideanion (OH−) is represented by [OH-], the hydronium cation (H3O+) is [OH3+] and the cobalt(III) cation (Co3+) is either [Co+3] or [Co+++].
Bonds
A bond is represented using one of the symbols . - = # $ : / \.
Bonds between aliphatic
atoms are assumed to be single unless specified otherwise and are
implied by adjacency in the SMILES string. Although single bonds may be
written as -, this is usually omitted. For example, the SMILES for ethanol may be written as C-C-O, CC-O or C-CO, but is usually written CCO.
Double, triple, and quadruple bonds are represented by the symbols =, #, and $ respectively as illustrated by the SMILES O=C=O (carbon dioxide CO2), C#N (hydrogen cyanide HCN) and [Ga+]$[As-] (gallium arsenide).
An additional type of bond is a "non-bond", indicated with ., to indicate that two parts are not bonded together. For example, aqueous sodium chloride may be written as [Na+].[Cl-] to show the dissociation.
An aromatic "one and a half" bond may be indicated with :; see § Aromaticity below.
Single bonds adjacent to double bonds may be represented using / or \ to indicate stereochemical configuration; see § Stereochemistry below.
Rings
Ring
structures are written by breaking each ring at an arbitrary point
(although some choices will lead to a more legible SMILES than others)
to make an acyclic structure and adding numerical ring closure labels to show connectivity between non-adjacent atoms.
For example, cyclohexane and dioxane may be written as C1CCCCC1 and O1CCOCC1 respectively. For a second ring, the label will be 2. For example, decalin (decahydronaphthalene) may be written as C1CCCC2C1CCCC2.
SMILES does not require that ring numbers be used in any
particular order, and permits ring number zero, although this is rarely
used. Also, it is permitted to reuse ring numbers after the first ring
has closed, although this usually makes formulae harder to read. For
example, bicyclohexyl is usually written as C1CCCCC1C2CCCCC2, but it may also be written as C0CCCCC0C0CCCCC0.
Multiple digits after a single atom indicate multiple
ring-closing bonds. For example, an alternative SMILES notation for
decalin is C1CCCC2CCCCC12, where the final carbon
participates in both ring-closing bonds 1 and 2. If two-digit ring
numbers are required, the label is preceded by %, so C%12 is a single ring-closing bond of ring 12.
Either or both of the digits may be preceded by a bond type to indicate the type of the ring-closing bond. For example, cyclopropene is usually written C1=CC1, but if the double bond is chosen as the ring-closing bond, it may be written as C=1CC1, C1CC=1, or C=1CC=1. (The first form is preferred.) C=1CC-1 is illegal, as it explicitly specifies conflicting types for the ring-closing bond.
Ring-closing bonds may not be used to denote multiple bonds. For example, C1C1 is not a valid alternative to C=C for ethylene. However, they may be used with non-bonds; C1.C2.C12 is a peculiar but legal alternative way to write propane, more commonly written CCC.
Choosing a ring-break point adjacent to attached groups can lead to a simpler SMILES form by avoiding branches. For example, cyclohexane-1,2-diol is most simply written as OC1CCCCC1O; choosing a different ring-break location produces a branched structure that requires parentheses to write.
Aromaticity
Aromatic rings such as benzene may be written in one of three forms:
In Kekulé form with alternating single and double bonds, e.g. C1=CC=CC=C1,
Using the aromatic bond symbol :, e.g. C1:C:C:C:C:C1, or
Most commonly, by writing the constituent B, C, N, O, P and S atoms in lower-case forms b, c, n, o, p and s, respectively.
In the latter case, bonds between two aromatic atoms are assumed (if not explicitly shown) to be aromatic bonds. Thus, benzene, pyridine and furan can be represented respectively by the SMILES c1ccccc1, n1ccccc1 and o1cccc1.
Aromatic nitrogen bonded to hydrogen, as found in pyrrole must be represented as [nH]; thus imidazole is written in SMILES notation as n1c[nH]cc1.
When aromatic atoms are singly bonded to each other, such as in biphenyl, a single bond must be shown explicitly: c1ccccc1-c2ccccc2. This is one of the few cases where the single bond symbol -
is required. (In fact, most SMILES software can correctly infer that
the bond between the two rings cannot be aromatic and so will accept the
nonstandard form c1ccccc1c2ccccc2.)
The Daylight and OpenEye algorithms for generating canonical SMILES differ in their treatment of aromaticity.
Visualization of 3-cyanoanisole as COc(c1)cccc1C#N.
Branching
Branches are described with parentheses, as in CCC(=O)O for propionic acid and FC(F)F for fluoroform.
The first atom within the parentheses, and the first atom after the
parenthesized group, are both bonded to the same branch point atom. The
bond symbol must appear inside the parentheses; outside (E.g.: CCC=(O)O) is invalid.
Substituted rings can be written with the branching point in the ring as illustrated by the SMILES COc(c1)cccc1C#N (see depiction) and COc(cc1)ccc1C#N (see depiction)
which encode the 3 and 4-cyanoanisole isomers. Writing SMILES for
substituted rings in this way can make them more human-readable.
Branches may be written in any order. For example, bromochlorodifluoromethane may be written as FC(Br)(Cl)F, BrC(F)(F)Cl, C(F)(Cl)(F)Br,
or the like. Generally, a SMILES form is easiest to read if the
simpler branch comes first, with the final, unparenthesized portion
being the most complex. The only caveats to such rearrangements are:
If ring numbers are reused, they are paired according to their
order of appearance in the SMILES string. Some adjustments may be
required to preserve the correct pairing.
If stereochemistry is specified, adjustments must be made; see Stereochemistry § Notes below.
The one form of branch which does not require parentheses are
ring-closing bonds. Choosing ring-closing bonds appropriately can
reduce the number of parentheses required. For example, toluene is normally written as Cc1ccccc1 or c1ccccc1C, avoiding the parentheses required if written as c1cc(C)ccc1 or c1cc(ccc1)C.
SMILES permits, but does not require, specification of stereoisomers.
Configuration around double bonds is specified using the characters / and \ to show directional single bonds adjacent to a double bond. For example, F/C=C/F (see depiction) is one representation of trans-1,2-difluoroethylene, in which the fluorine atoms are on opposite sides of the double bond (as shown in the figure), whereas F/C=C\F (see depiction) is one possible representation of cis-1,2-difluoroethylene, in which the fluorines are on the same side of the double bond.
Bond direction symbols always come in groups of at least two, of which the first is arbitrary. That is, F\C=C\F is the same as F/C=C/F.
When alternating single-double bonds are present, the groups are
larger than two, with the middle directional symbols being adjacent to
two double bonds. For example, the common form of (2,4)-hexadiene is
written C/C=C/C=C/C.
Beta-carotene, with the eleven double bonds highlighted.
As a more complex example, beta-carotene has a very long backbone of alternating single and double bonds, which may be written CC1CCC/C(C)=C1/C=C/C(C)=C/C=C/C(C)=C/C=C/C=C(C)/C=C/C=C(C)/C=C/C2=C(C)/CCCC2(C)C.
Configuration at tetrahedral carbon is specified by @ or @@.
Consider the four bonds in the order in which they appear, left to
right, in the SMILES form. Looking toward the central carbon from the
perspective of the first bond, the other three are either clockwise or
counter-clockwise. These cases are indicated with @@ and @, respectively (because the @ symbol itself is a counter-clockwise spiral).
L-Alanine
For example, consider the amino acidalanine. One of its SMILES forms is NC(C)C(=O)O, more fully written as N[CH](C)C(=O)O. L-Alanine, the more common enantiomer, is written as N[C@@H](C)C(=O)O (see depiction). Looking from the nitrogen–carbon bond, the hydrogen (H), methyl (C), and carboxylate (C(=O)O) groups appear clockwise. D-Alanine can be written as N[C@H](C)C(=O)O (see depiction).
While the order in which branches are specified in SMILES is
normally unimportant, in this case it matters; swapping any two groups
requires reversing the chirality indicator. If the branches are
reversed so alanine is written as NC(C(=O)O)C, then the configuration also reverses; L-alanine is written as N[C@H](C(=O)O)C (see depiction). Other ways of writing it include C[C@H](N)C(=O)O, OC(=O)[C@@H](N)C and OC(=O)[C@H](C)N.
Normally, the first of the four bonds appears to the left of the
carbon atom, but if the SMILES is written beginning with the chiral
carbon, such as C(C)(N)C(=O)O, then all four are to the right, but the first to appear (the [CH] bond in this case) is used as the reference to order the following three: L-alanine may also be written [C@@H](C)(N)C(=O)O.
The SMILES specification includes elaborations on the @ symbol to indicate stereochemistry around more complex chiral centers, such as trigonal bipyramidal molecular geometry.
Isotopes
Isotopes are specified with a number equal to the integer isotopic mass preceding the atomic symbol. Benzene in which one atom is carbon-14 is written as [14c]1ccccc1 and deuterochloroform is [2H]C(Cl)(Cl)Cl.
Note that % appears in front of the index of ring closure labels above 9; see § Rings above.
Other examples of SMILES
The
SMILES notation is described extensively in the SMILES theory manual
provided by Daylight Chemical Information Systems and a number of
illustrative examples are presented. Daylight's depict utility provides
users with the means to check their own examples of SMILES and is a
valuable educational tool.
Extensions
SMARTS
is a line notation for specification of substructural patterns in
molecules. While it uses many of the same symbols as SMILES, it also
allows specification of wildcard atoms and bonds, which can be used to define substructural queries for chemical database
searching. One common misconception is that SMARTS-based substructural
searching involves matching of SMILES and SMARTS strings. In fact, both
SMILES and SMARTS strings are first converted to internal graph
representations which are searched for subgraphisomorphism.
SMIRKS, a superset of
"reaction SMILES" and a subset of "reaction SMARTS", is a line notation
for specifying reaction transforms. The general syntax for the reaction
extensions is REACTANT>AGENT>PRODUCT (without spaces), where any of the fields can either be left blank or filled with multiple molecules deliminated with a dot (.), and other descriptions dependent on the base language. Atoms can additionally be identified with a number (e.g. [C:1]) for mapping, for example in .
SMILES corresponds to discrete molecular structures. However many
materials are macromolecules, which are too large (and often
stochastic) to conveniently generate SMILES for. BigSMILES is an
extension of SMILES that aims to provide an efficient representation
system for macromolecules.
Conversion
SMILES can be converted back to two-dimensional representations using structure diagram generation (SDG) algorithms.
This conversion is not always unambiguous. Conversion to
three-dimensional representation is achieved by energy-minimization
approaches. There are many downloadable and web-based conversion
utilities.
The identifiers describe chemical substances in terms of layers of information — the atoms and their bond connectivity, tautomeric information, isotope information, stereochemistry, and electronic charge information.
Not all layers have to be provided; for instance, the tautomer layer can
be omitted if that type of information is not relevant to the
particular application. The InChI algorithm converts input structural
information into a unique InChI identifier in a three-step process:
normalization (to remove redundant information), canonicalization (to
generate a unique number label for each atom), and serialization (to
give a string of characters).
InChIs differ from the widely used CAS registry numbers
in three respects: firstly, they are freely usable and non-proprietary;
secondly, they can be computed from structural information and do not
have to be assigned by some organization; and thirdly, most of the
information in an InChI is human readable (with practice). InChIs can
thus be seen as akin to a general and extremely formalized version of IUPAC names. They can express more information than the simpler SMILES
notation and differ in that every structure has a unique InChI string,
which is important in database applications. Information about the
3-dimensional coordinates of atoms is not represented in InChI; for this
purpose a format such as PDB can be used.
The InChIKey, sometimes referred to as a hashed InChI, is a fixed
length (27 character) condensed digital representation of the InChI
that is not human-understandable. The InChIKey specification was
released in September 2007 in order to facilitate web searches for
chemical compounds, since these were problematic with the full-length
InChI. Unlike the InChI, the InChIKey is not unique: though collisions can be calculated to be very rare, they happen.
In January 2009 the 1.02 version of the InChI software was
released. This provided a means to generate so called standard InChI,
which does not allow for user selectable options in dealing with the
stereochemistry and tautomeric layers of the InChI string. The standard
InChIKey is then the hashed version of the standard InChI string. The
standard InChI will simplify comparison of InChI strings and keys
generated by different groups, and subsequently accessed via diverse
sources such as databases and web resources.
The continuing development of the standard has been supported since 2010 by the not-for-profit InChI Trust, of which IUPAC is a member. The current software version is 1.06 and was released in December 2020. Prior to 1.04, the software was freely available under the open-sourceLGPL license,
but it now uses a custom license called IUPAC-InChI Trust License.
Generation
In order to avoid generating different InChIs for tautomeric structures, before generating the
InChI, an input chemical structure is normalized to reduce it to its so-called core parent structure.
This may involve changing bond orders, rearranging formal charges and possibly adding
and removing protons. Different input structures may give the same result; for example,
acetic acid and acetate would both give the same core parent structure, that of acetic
acid. A core parent structure may be disconnected, consisting of more than one component,
in which case the
sublayers in the InChI usually consist of sublayers for each component, separated by semicolons
(periods for the chemical formula sublayer.) One way this can happen is that all
metal atoms are disconnected during normalization; so, for example, the InChI for tetraethyllead
will have five components, one for lead and four for the ethyl groups.
The first, main, layer of the InChI refers to this core parent structure, giving its
chemical formula, non-hydrogen connectivity without bond order (/c sublayer) and
hydrogen connectivity (/h sublayer.) The /q portion of
the charge layer gives its charge, and the /p portion of the charge layer tells how many
protons (hydrogen ions) must be added to or removed from it to regenerate the original structure.
If present, the stereochemical layer, with sublayers /b, /t, /m and /s, gives stereochemical information, and
the isotopic layer
/i (which may contain sublayers /h, /b, /t, /m and /s) gives isotopic information.
These are the only layers which can occur in a standard InChI.
If the user wants to specify an exact tautomer, a fixed hydrogen layer /f
can be appended, which may contain
various additional sublayers; this cannot be done in standard InChI
though, so different tautomers will have
the same standard InChI (for example, alanine will give the same
standard InChI whether input in a neutral or a zwitterionic form.)
Finally, a nonstandard reconnected /r layer can be added, which effectively gives a new InChI generated without
breaking bonds to metal atoms. This may contain various sublayers, including /f.
Every InChI starts with the string "InChI=" followed by the version number, currently 1. If the InChI is standard, this is followed by the letter S for standard InChIs,
which is a fully standardized InChI flavor maintaining the same level
of attention to structure details and the same conventions for drawing
perception. The remaining information is structured as a sequence of
layers and sub-layers, with each layer providing one specific type of
information. The layers and sub-layers are separated by the delimiter "/"
and start with a characteristic prefix letter (except for the chemical
formula sub-layer of the main layer). The six layers with important
sublayers are:
Main layer
Chemical formula
(no prefix). This is the only sublayer that must occur in every InChI.
Numbers used throughout the InChI are given in the formula's element
order excluding hydrogen atoms. For example, “/C10H16N5O13P3” implies
that atoms numbered 1–10 are carbons, 11–15 are nitrogens, 16–28 are
oxygens, and 29–31 are phosphorus.
Atom connections (prefix: "c"). The atoms in the
chemical formula (except for hydrogens) are numbered in sequence; this
sublayer describes which atoms are connected by bonds to which other
ones.
Hydrogen atoms (prefix: "h"). Describes how many hydrogen atoms are connected to each of the other atoms.
tetrahedral stereochemistry of atoms and allenes (prefixes: "t", "m")
type of stereochemistry information (prefix: "s")
Isotopic layer (prefixes: "i", "h", as well as "b", "t", "m", "s" for isotopic stereochemistry)
Fixed-H layer (prefix: "f"); contains some or all of the above types of layers except atom connections; may end with "o" sublayer; never included in standard InChI
Reconnected layer (prefix: "r"); contains the whole InChI of a structure with reconnected metal atoms; never included in standard InChI
The delimiter-prefix format has the advantage that a user can easily use a wildcard search to find identifiers that match only in certain layers.
The condensed, 27 character InChIKey is a hashed version of the full InChI (using the SHA-256 algorithm), designed to allow for easy web searches of chemical compounds. The standard InChIKey is the hashed counterpart of standard InChI. Most chemical structures on the Web up to 2007 have been represented as GIF files,
which are not searchable for chemical content. The full InChI turned
out to be too lengthy for easy searching, and therefore the InChIKey was
developed. There is a very small, but nonzero chance of two different
molecules having the same InChIKey, but the probability for duplication
of only the first 14 characters has been estimated as only one
duplication in 75 databases each containing one billion unique
structures. With all databases currently having below 50 million
structures, such duplication appears unlikely at present. A recent study
more extensively studies the collision rate finding that the
experimental collision rate is in agreement with the theoretical
expectations.
The InChIKey currently consists of three parts separated by hyphens, of 14, 10 and one character(s), respectively, like XXXXXXXXXXXXXX-YYYYYYYYFV-P. The first 14 characters result from a SHA-256 hash of the connectivity information (the main layer and /q
sublayer of the charge layer) of the InChI. The second part consists of
8 characters resulting from a hash of the remaining layers of the
InChI, a single character indicating the kind of InChIKey (S for standard and N for nonstandard), and a character indicating the version of InChI used (currently A for version 1.)
Finally, the single character at the end indicates the protonation of the core parent structure, corresponding to the /p sublayer of the charge layer (N for no protonation, O, P, ... if protons should be added and M, L, ... if they should be removed.)
Example
Morphine structure
Morphine has the structure shown on the right. The standard InChI for morphine is InChI=1S/C17H19NO3/c1-18-7-6-17-10-3-5-13(20)16(17)21-15-12(19)4-2-9(14(15)17)8-11(10)18/h2-5,10-11,13,16,19-20H,6-8H2,1H3/t10-,11+,13-,16-,17-/m0/s1
and the standard InChIKey for morphine is BQJCRHHNABKAKU-KBQPJGBKSA-N.
InChI resolvers
As
the InChI cannot be reconstructed from the InChIKey, an InChIKey always
needs to be linked to the original InChI to get back to the original
structure. InChI Resolvers act as a lookup service to make these links,
and prototype services are available from National Cancer Institute, the UniChem service at the European Bioinformatics Institute, and PubChem. ChemSpider has had a resolver until July 2015 when it was decommissioned.
Name
The format
was originally called IChI (IUPAC Chemical Identifier), then renamed in
July 2004 to INChI (IUPAC-NIST Chemical Identifier), and renamed again
in November 2004 to InChI (IUPAC International Chemical Identifier), a
trademark of IUPAC.
Continuing development
Scientific
direction of the InChI standard is carried out by the IUPAC Division
VIII Subcommittee, and funding of subgroups investigating and defining
the expansion of the standard is carried out by both IUPAC and the InChI Trust. The InChI Trust funds the development, testing and documentation of the InChI. Current extensions are being defined to handle polymers and mixtures, Markush structures, reactions and organometallics, and once accepted by the Division VIII Subcommittee will be added to the algorithm.
Software
The
InChI Trust has developed software to generate the InChI, InChIKey and
other identifiers. The release history of this software follows.
Changed format for InChIKey. Introduces standard InChI.
InChI v. 1.03
June 2010
LGPL 2.1
InChI v. 1.03 source code docs
March 2011
InChI v. 1.04
Sep. 2011
IUPAC/InChI Trust InChI Licence 1.0
New license. Support for elements 105-112 added. CML support removed.
InChI v. 1.05
Jan. 2017
IUPAC/InChI Trust InChI Licence 1.0
Support for elements 113-118 added. Experimental polymer support. Experimental large molecule support.
RInChI v. 1.00
March 2017
IUPAC/InChI Trust InChI Licence 1.0, and BSD-style
Computes reaction InChIs.
InChI v. 1.06
Dec. 2020
IUPAC/InChI Trust InChI Licence 1.0
Revised polymer support.
Adoption
The InChI has been adopted by many larger and smaller databases, including ChemSpider, ChEMBL, Golm Metabolome Database, OpenPHACTS, and PubChem.
However, the adoption is not straightforward, and many databases show a
discrepancy between the chemical structures and the InChI they contain,
which is a problem for linking databases.