Search This Blog

Wednesday, January 2, 2019

Pharmaceutical industry (updated)

From Wikipedia, the free encyclopedia

Glivec, a drug used in the treatment of several cancers, is marketed by Novartis, one of the world's major pharmaceutical companies.

The pharmaceutical industry discovers, develops, produces, and markets drugs or pharmaceutical drugs for use as medications to be administered (or self-administered) to patients to cure them, vaccinate them, or alleviate a symptom. Pharmaceutical companies may deal in generic or brand medications and medical devices. They are subject to a variety of laws and regulations that govern the patenting, testing, safety, efficacy and marketing of drugs.

History

Mid-1800s – 1945: From botanicals to the first synthetic drugs

The modern pharmaceutical industry traces its roots to two sources. The first of these were local apothecaries that expanded from their traditional role distributing botanical drugs such as morphine and quinine to wholesale manufacture in the mid 1800s. Rational drug discovery from plants started particularly with the isolation of morphine, analgesic and sleep-inducing agent from opium, by the German apothecary assistant Friedrich Sertürner who named the compound after the Greek god of dreams, Morpheus. By the late 1880s, German dye manufacturers had perfected the purification of individual organic compounds from tar and other mineral sources and had also established rudimentary methods in organic chemical synthesis. The development of synthetic chemical methods allowed scientists to systematically vary the structure of chemical substances, and growth in the emerging science of pharmacology expanded their ability to evaluate the biological effects of these structural changes.

Epinephrine, norepinephrine, and amphetamine

By the 1890s, the profound effect of adrenal extracts on many different tissue types had been discovered, setting off a search both for the mechanism of chemical signalling and efforts to exploit these observations for the development of new drugs. The blood pressure raising and vasoconstrictive effects of adrenal extracts were of particular interest to surgeons as hemostatic agents and as treatment for shock, and a number of companies developed products based on adrenal extracts containing varying purities of the active substance. In 1897, John Abel of Johns Hopkins University identified the active principle as epinephrine, which he isolated in an impure state as the sulfate salt. Industrial chemist Jokichi Takamine later developed a method for obtaining epinephrine in a pure state, and licensed the technology to Parke-Davis. Parke-Davis marketed epinephrine under the trade name Adrenalin. Injected epinephrine proved to be especially efficacious for the acute treatment of asthma attacks, and an inhaled version was sold in the United States until 2011 (Primatene Mist). By 1929 epinephrine had been formulated into an inhaler for use in the treatment of nasal congestion. 

While highly effective, the requirement for injection limited the use of epinephrine and orally active derivatives were sought. A structurally similar compound, ephedrine, (actually more similar to norepinephrine,) was identified by Japanese chemists in the Ma Huang plant and marketed by Eli Lilly as an oral treatment for asthma. Following the work of Henry Dale and George Barger at Burroughs-Wellcome, academic chemist Gordon Alles synthesized amphetamine and tested it in asthma patients in 1929. The drug proved to have only modest anti-asthma effects, but produced sensations of exhilaration and palpitations. Amphetamine was developed by Smith, Kline and French as a nasal decongestant under the trade name Benzedrine Inhaler. Amphetamine was eventually developed for the treatment of narcolepsy, post-encephalitic parkinsonism, and mood elevation in depression and other psychiatric indications. It received approval as a New and Nonofficial Remedy from the American Medical Association for these uses in 1937 and remained in common use for depression until the development of tricyclic antidepressants in the 1960s.

Discovery and development of the barbiturates

Diethylbarbituric acid was the first marketed barbiturate. It was sold by Bayer under the trade name Veronal
 
In 1903, Hermann Emil Fischer and Joseph von Mering disclosed their discovery that diethylbarbituric acid, formed from the reaction of diethylmalonic acid, phosphorus oxychloride and urea, induces sleep in dogs. The discovery was patented and licensed to Bayer pharmaceuticals, which marketed the compound under the trade name Veronal as a sleep aid beginning in 1904. Systematic investigations of the effect of structural changes on potency and duration of action led to the discovery of phenobarbital at Bayer in 1911 and the discovery of its potent anti-epileptic activity in 1912. Phenobarbital was among the most widely used drugs for the treatment of epilepsy through the 1970s, and as of 2014, remains on the World Health Organizations list of essential medications. The 1950s and 1960s saw increased awareness of the addictive properties and abuse potential of barbiturates and amphetamines and led to increasing restrictions on their use and growing government oversight of prescribers. Today, amphetamine is largely restricted to use in the treatment of attention deficit disorder and phenobarbital in the treatment of epilepsy.

Insulin

A series of experiments performed from the late 1800s to the early 1900s revealed that diabetes is caused by the absence of a substance normally produced by the pancreas. In 1869, Oskar Minkowski and Joseph von Mering found that diabetes could be induced in dogs by surgical removal of the pancreas. In 1921, Canadian professor Frederick Banting and his student Charles Best repeated this study, and found that injections of pancreatic extract reversed the symptoms produced by pancreas removal. Soon, the extract was demonstrated to work in people, but development of insulin therapy as a routine medical procedure was delayed by difficulties in producing the material in sufficient quantity and with reproducible purity. The researchers sought assistance from industrial collaborators at Eli Lilly and Co. based on the company's experience with large scale purification of biological materials. Chemist George B. Walden of Eli Lilly and Company found that careful adjustment of the pH of the extract allowed a relatively pure grade of insulin to be produced. Under pressure from Toronto University and a potential patent challenge by academic scientists who had independently developed a similar purification method, an agreement was reached for non-exclusive production of insulin by multiple companies. Prior to the discovery and widespread availability of insulin therapy the life expectancy of diabetics was only a few months.

Early anti-infective research: Salvarsan, Prontosil, Penicillin and vaccines

The development of drugs for the treatment of infectious diseases was a major focus of early research and development efforts; in 1900 pneumonia, tuberculosis, and diarrhea were the three leading causes of death in the United States and mortality in the first year of life exceeded 10%.

In 1911 arsphenamine, the first synthetic anti-infective drug, was developed by Paul Ehrlich and chemist Alfred Bertheim of the Institute of Experimental Therapy in Berlin. The drug was given the commercial name Salvarsan. Ehrlich, noting both the general toxicity of arsenic and the selective absorption of certain dyes by bacteria, hypothesized that an arsenic-containing dye with similar selective absorption properties could be used to treat bacterial infections. Arsphenamine was prepared as part of a campaign to synthesize a series of such compounds, and found to exhibit partially selective toxicity. Arsphenamine proved to be the first effective treatment for syphilis, a disease which prior to that time was incurable and led inexorably to severe skin ulceration, neurological damage, and death.

Ehrlich's approach of systematically varying the chemical structure of synthetic compounds and measuring the effects of these changes on biological activity was pursued broadly by industrial scientists, including Bayer scientists Josef Klarer, Fritz Mietzsch, and Gerhard Domagk. This work, also based in the testing of compounds available from the German dye industry, led to the development of Prontosil, the first representative of the sulfonamide class of antibiotics. Compared to arsphenamine, the sulfonamides had a broader spectrum of activity and were far less toxic, rendering them useful for infections caused by pathogens such as streptococci. In 1939, Domagk received the Nobel Prize in Medicine for this discovery. Nonetheless, the dramatic decrease in deaths from infectious diseases that occurred prior to World War II was primarily the result of improved public health measures such as clean water and less crowded housing, and the impact of anti-infective drugs and vaccines was significant mainly after World War II.

In 1928, Alexander Fleming discovered the antibacterial effects of penicillin, but its exploitation for the treatment of human disease awaited the development of methods for its large scale production and purification. These were developed by a U.S. and British government-led consortium of pharmaceutical companies during the Second World War.

Early progress toward the development of vaccines occurred throughout this period, primarily in the form of academic and government-funded basic research directed toward the identification of the pathogens responsible for common communicable diseases. In 1885 Louis Pasteur and Pierre Paul Émile Roux created the first rabies vaccine. The first diphtheria vaccines were produced in 1914 from a mixture of diphtheria toxin and antitoxin (produced from the serum of an inoculated animal), but the safety of the inoculation was marginal and it was not widely used. The United States recorded 206,000 cases of diphtheria in 1921 resulting in 15,520 deaths. In 1923 parallel efforts by Gaston Ramon at the Pasteur Institute and Alexander Glenny at the Wellcome Research Laboratories (later part of GlaxoSmithKline) led to the discovery that a safer vaccine could be produced by treating diphtheria toxin with formaldehyde. In 1944, Maurice Hilleman of Squibb Pharmaceuticals developed the first vaccine against Japanese encephelitis. Hilleman would later move to Merck where he would play a key role in the development of vaccines against measles, mumps, chickenpox, rubella, hepatitis A, hepatitis B, and meningitis.

Unsafe drugs and early industry regulation

In 1937 over 100 people died after ingesting a solution of the antibacterial sulfanilamide formulated in the toxic solvent diethylene glycol
 
Prior to the 20th century drugs were generally produced by small scale manufacturers with little regulatory control over manufacturing or claims of safety and efficacy. To the extent that such laws did exist, enforcement was lax. In the United States, increased regulation of vaccines and other biological drugs was spurred by tetanus outbreaks and deaths caused by the distribution of contaminated smallpox vaccine and diphtheria antitoxin. The Biologics Control Act of 1902 required that federal government grant premarket approval for every biological drug and for the process and facility producing such drugs. This was followed in 1906 by the Pure Food and Drugs Act, which forbade the interstate distribution of adulterated or misbranded foods and drugs. A drug was considered misbranded if it contained alcohol, morphine, opium, cocaine, or any of several other potentially dangerous or addictive drugs, and if its label failed to indicate the quantity or proportion of such drugs. The government's attempts to use the law to prosecute manufacturers for making unsupported claims of efficacy were undercut by a Supreme Court ruling restricting the federal government's enforcement powers to cases of incorrect specification of the drug's ingredients.

In 1937 over 100 people died after ingesting "Elixir Sulfanilamide" manufactured by S.E. Massengill Company of Tennessee. The product was formulated in diethylene glycol, a highly toxic solvent that is now widely used as antifreeze. Under the laws extant at that time, prosecution of the manufacturer was possible only under the technicality that the product had been called an "elixir", which literally implied a solution in ethanol. In response to this episode, the U.S. Congress passed the Federal Food, Drug, and Cosmetic Act of 1938, which for the first time required pre-market demonstration of safety before a drug could be sold, and explicitly prohibited false therapeutic claims.

The post-war years, 1945–1970

Further advances in anti-infective research

The aftermath of World War II saw an explosion in the discovery of new classes of antibacterial drugs including the cephalosporins (developed by Eli Lilly based on the seminal work of Giuseppe Brotzu and Edward Abraham), streptomycin (discovered during a Merck-funded research program in Selman Waksman's laboratory), the tetracyclines (discovered at Lederle Laboratories, now a part of Pfizer), erythromycin (discovered at Eli Lilly and Co.) and their extension to an increasingly wide range of bacterial pathogens. Streptomycin, discovered during a Merck-funded research program in Selman Waksman's laboratory at Rutgers in 1943, became the first effective treatment for tuberculosis. At the time of its discovery, sanitoriums for the isolation of tuberculosis-infected people were an ubiquitous feature of cities in developed countries, with 50% dying within 5 years of admission.

A Federal Trade Commission report issued in 1958 attempted to quantify the effect of antibiotic development on American public health. The report found that over the period 1946-1955, there was a 42% drop in the incidence of diseases for which antibiotics were effective and only a 20% drop in those for which antibiotics were not effective. The report concluded that "it appears that the use of antibiotics, early diagnosis, and other factors have limited the epidemic spread and thus the number of these diseases which have occurred". The study further examined mortality rates for eight common diseases for which antibiotics offered effective therapy (syphilis, tuberculosis, dysentery, scarlet fever, whooping cough, meningococcal infections, and pneumonia), and found a 56% decline over the same period. Notable among these was a 75% decline in deaths due to tuberculosis.

Measles cases 1944-1964 follow a highly variable epidemic pattern, with 150,000-850,000 cases per year. A sharp decline followed introduction of the vaccine in 1963, with fewer than 25,000 cases reported in 1968. Outbreaks around 1971 and 1977 gave 75,000 and 57,000 cases, respectively. Cases were stable at a few thousand per year until an outbreak of 28,000 in 1990. Cases declined from a few hundred per year in the early 1990s to a few dozen in the 2000s.
Measles cases reported in the United States before and after introduction of the vaccine.
 
Life expectancy by age in 1900, 1950, and 1997 United States.
Percent surviving by age in 1900, 1950, and 1997.
 
During the years 1940-1955, the rate of decline in the U.S. death rate accelerated from 2% per year to 8% per year, then returned to the historical rate of 2% per year. The dramatic decline in the immediate post-war years has been attributed to the rapid development of new treatments and vaccines for infectious disease that occurred during these years. Vaccine development continued to accelerate, with the most notable achievement of the period being Jonas Salk's 1954 development of the polio vaccine under the funding of the non-profit National Foundation for Infantile Paralysis. The vaccine process was never patented, but was instead given to pharmaceutical companies to manufacture as a low-cost generic. In 1960 Maurice Hilleman of Merck Sharp & Dohme identified the SV40 virus, which was later shown to cause tumors in many mammalian species. It was later determined that SV40 was present as a contaminant in polio vaccine lots that had been administered to 90% of the children in the United States. The contamination appears to have originated both in the original cell stock and in monkey tissue used for production. In 2004 the United States Cancer Institute announced that it had concluded that SV40 is not associated with cancer in people.

Other notable new vaccines of the period include those for measles (1962, John Franklin Enders of Children's Medical Center Boston, later refined by Maurice Hilleman at Merck), Rubella (1969, Hilleman, Merck) and mumps (1967, Hilleman, Merck) The United States incidences of rubella, congenital rubella syndrome, measles, and mumps all fell by more than 95% in the immediate aftermath of widespread vaccination. The first 20 years of licensed measles vaccination in the U.S. prevented an estimated 52 million cases of the disease, 17,400 cases of mental retardation, and 5,200 deaths.

Development and marketing of antihypertensive drugs

Hypertension is a risk factor for atherosclerosis, heart failure, coronary artery disease, stroke, renal disease, and peripheral arterial disease, and is the most important risk factor for cardiovascular morbidity and mortality, in industrialized countries. Prior to 1940 approximately 23% of all deaths among persons over age 50 were attributed to hypertension. Severe cases of hypertension were treated by surgery.

Early developments in the field of treating hypertension included quaternary ammonium ion sympathetic nervous system blocking agents, but these compounds were never widely used due to their severe side effects, because the long term health consequences of high blood pressure had not yet been established, and because they had to be administered by injection. 

In 1952 researchers at Ciba discovered the first orally available vasodilator, hydralazine. A major shortcoming of hydralazine monotherapy was that it lost its effectiveness over time (tachyphylaxis). In the mid-1950s Karl H. Beyer, James M. Sprague, John E. Baer, and Frederick C. Novello of Merck and Co. discovered and developed chlorothiazide, which remains the most widely used antihypertensive drug today. This development was associated with a substantial decline in the mortality rate among people with hypertension. The inventors were recognized by a Public Health Lasker Award in 1975 for "the saving of untold thousands of lives and the alleviation of the suffering of millions of victims of hypertension".

A 2009 Cochrane review concluded that thiazide antihypertensive drugs reduce the risk of death (RR 0.89), stroke (RR 0.63), coronary heart disease (RR 0.84), and cardiovascular events (RR 0.70) in people with high blood pressure. In the ensuring years other classes of antihypertensive drug were developed and found wide acceptance in combination therapy, including loop diuretics (Lasix/furosemide, Hoechst Pharmaceuticals, 1963), beta blockers (ICI Pharmaceuticals, 1964) ACE inhibitors, and angiotensin receptor blockers. ACE inhibitors reduce the risk of new onset kidney disease [RR 0.71] and death [RR 0.84] in diabetic patients, irrespective of whether they have hypertension.

Oral Contraceptives

Prior to the second world war, birth control was prohibited in many countries, and in the United States even the discussion of contraceptive methods sometimes led to prosecution under Comstock laws. The history of the development of oral contraceptives is thus closely tied to the birth control movement and the efforts of activists Margaret Sanger, Mary Dennett, and Emma Goldman. Based on fundamental research performed by Gregory Pincus and synthetic methods for progesterone developed by Carl Djerassi at Syntex and by Frank Colton at G.D. Searle & Co., the first oral contraceptive, Enovid, was developed by E.D. Searle and Co. and approved by the FDA in 1960. The original formulation incorporated vastly excessive doses of hormones, and caused severe side effects. Nonetheless, by 1962, 1.2 million American women were on the pill, and by 1965 the number had increased to 6.5 million. The availability of a convenient form of temporary contraceptive led to dramatic changes in social mores including expanding the range of lifestyle options available to women, reducing the reliance of women on men for contraceptive practice, encouraging the delay of marriage, and increasing pre-marital co-habitation.

Thalidomide and the Kefauver-Harris Amendments

Malformation of a baby born to a mother who had taken thalidomide while pregnant.
 
In the U.S., a push for revisions of the FD&C Act emerged from Congressional hearings led by Senator Estes Kefauver of Tennessee in 1959. The hearings covered a wide range of policy issues, including advertising abuses, questionable efficacy of drugs, and the need for greater regulation of the industry. While momentum for new legislation temporarily flagged under extended debate, a new tragedy emerged that underscored the need for more comprehensive regulation and provided the driving force for the passage of new laws. 

On 12 September 1960, an American licensee, the William S. Merrell Company of Cincinnati, submitted a new drug application for Kevadon (thalidomide), a sedative that had been marketed in Europe since 1956. The FDA medical officer in charge of reviewing the compound, Frances Kelsey, believed that the data supporting the safety of thalidomide was incomplete. The firm continued to pressure Kelsey and the FDA to approve the application until November 1961, when the drug was pulled off the German market because of its association with grave congenital abnormalities. Several thousand newborns in Europe and elsewhere suffered the teratogenic effects of thalidomide. Without approval from the FDA, the firm distributed Kevadon to over 1,000 physicians there under the guise of investigational use. Over 20,000 Americans received thalidomide in this "study," including 624 pregnant patients, and about 17 known newborns suffered the effects of the drug.

The thalidomide tragedy resurrected Kefauver's bill to enhance drug regulation that had stalled in Congress, and the Kefauver-Harris Amendment became law on 10 October 1962. Manufacturers henceforth had to prove to FDA that their drugs were effective as well as safe before they could go on the US market. The FDA received authority to regulate advertising of prescription drugs and to establish good manufacturing practices. The law required that all drugs introduced between 1938 and 1962 had to be effective. An FDA - National Academy of Sciences collaborative study showed that nearly 40 percent of these products were not effective. A similarly comprehensive study of over-the-counter products began ten years later.

1970–1980s

Statins

In 1971, Akira Endo, a Japanese biochemist working for the pharmaceutical company Sankyo, identified mevastatin (ML-236B), a molecule produced by the fungus Penicillium citrinum, as an inhibitor of HMG-CoA reductase, a critical enzyme used by the body to produce cholesterol. Animal trials showed very good inhibitory effect as in clinical trials, however a long term study in dogs found toxic effects at higher doses and as a result mevastatin was believed to be too toxic for human use. Mevastatin was never marketed, because of its adverse effects of tumors, muscle deterioration, and sometimes death in laboratory dogs. 

P. Roy Vagelos, chief scientist and later CEO of Merck & Co, was interested, and made several trips to Japan starting in 1975. By 1978, Merck had isolated lovastatin (mevinolin, MK803) from the fungus Aspergillus terreus, first marketed in 1987 as Mevacor.

In April 1994, the results of a Merck-sponsored study, the Scandinavian Simvastatin Survival Study, were announced. Researchers tested simvastatin, later sold by Merck as Zocor, on 4,444 patients with high cholesterol and heart disease. After five years, the study concluded the patients saw a 35% reduction in their cholesterol, and their chances of dying of a heart attack were reduced by 42%. In 1995, Zocor and Mevacor both made Merck over US$1 billion. Endo was awarded the 2006 Japan Prize, and the Lasker-DeBakey Clinical Medical Research Award in 2008. For his "pioneering research into a new class of molecules" for "lowering cholesterol,"

Research and development

Drug discovery is the process by which potential drugs are discovered or designed. In the past most drugs have been discovered either by isolating the active ingredient from traditional remedies or by serendipitous discovery. Modern biotechnology often focuses on understanding the metabolic pathways related to a disease state or pathogen, and manipulating these pathways using molecular biology or biochemistry. A great deal of early-stage drug discovery has traditionally been carried out by universities and research institutions.

Drug development refers to activities undertaken after a compound is identified as a potential drug in order to establish its suitability as a medication. Objectives of drug development are to determine appropriate formulation and dosing, as well as to establish safety. Research in these areas generally includes a combination of in vitro studies, in vivo studies, and clinical trials. The cost of late stage development has meant it is usually done by the larger pharmaceutical companies.

Often, large multinational corporations exhibit vertical integration, participating in a broad range of drug discovery and development, manufacturing and quality control, marketing, sales, and distribution. Smaller organizations, on the other hand, often focus on a specific aspect such as discovering drug candidates or developing formulations. Often, collaborative agreements between research organizations and large pharmaceutical companies are formed to explore the potential of new drug substances. More recently, multi-nationals are increasingly relying on contract research organizations to manage drug development.

The cost of innovation

Drug discovery and development is very expensive; of all compounds investigated for use in humans only a small fraction are eventually approved in most nations by government appointed medical institutions or boards, who have to approve new drugs before they can be marketed in those countries. In 2010 18 NMEs (New Molecular Entities) were approved and three biologics by the FDA, or 21 in total, which is down from 26 in 2009 and 24 in 2008. On the other hand, there were only 18 approvals in total in 2007 and 22 back in 2006. Since 2001, the Center for Drug Evaluation and Research has averaged 22.9 approvals a year. This approval comes only after heavy investment in pre-clinical development and clinical trials, as well as a commitment to ongoing safety monitoring. Drugs which fail part-way through this process often incur large costs, while generating no revenue in return. If the cost of these failed drugs is taken into account, the cost of developing a successful new drug (new chemical entity, or NCE), has been estimated at about US$1.3 billion (not including marketing expenses). Professors Light and Lexchin reported in 2012, however, that the rate of approval for new drugs has been a relatively stable average rate of 15 to 25 for decades.

Industry-wide research and investment reached a record $65.3 billion in 2009. While the cost of research in the U.S. was about $34.2 billion between 1995 and 2010, revenues rose faster (revenues rose by $200.4 billion in that time).

A study by the consulting firm Bain & Company reported that the cost for discovering, developing and launching (which factored in marketing and other business expenses) a new drug (along with the prospective drugs that fail) rose over a five-year period to nearly $1.7 billion in 2003. According to Forbes, by 2010 development costs were between $4 billion to $11 billion per drug.

Some of these estimates also take into account the opportunity cost of investing capital many years before revenues are realized. Because of the very long time needed for discovery, development, and approval of pharmaceuticals, these costs can accumulate to nearly half the total expense. A direct consequence within the pharmaceutical industry value chain is that major pharmaceutical multinationals tend to increasingly outsource risks related to fundamental research, which somewhat reshapes the industry ecosystem with biotechnology companies playing an increasingly important role, and overall strategies being redefined accordingly. Some approved drugs, such as those based on re-formulation of an existing active ingredient (also referred to as Line-extensions) are much less expensive to develop.

Controversies

Due to repeated accusations and findings that some clinical trials conducted or funded by pharmaceutical companies may report only positive results for the preferred medication, the industry has been looked at much more closely by independent groups and government agencies.

In response to specific cases in which unfavorable data from pharmaceutical company-sponsored research was not published, the Pharmaceutical Research and Manufacturers of America have published new guidelines urging companies to report all findings and limit the financial involvement in drug companies of researchers. US congress signed into law a bill which requires phase II and phase III clinical trials to be registered by the sponsor on the clinicaltrials.gov website run by the NIH.

Drug researchers not directly employed by pharmaceutical companies often look to companies for grants, and companies often look to researchers for studies that will make their products look favorable. Sponsored researchers are rewarded by drug companies, for example with support for their conference/symposium costs. Lecture scripts and even journal articles presented by academic researchers may actually be "ghost-written" by pharmaceutical companies.

An investigation by ProPublica found that at least 21 doctors have been paid more than $500,000 for speeches and consulting by drugs manufacturers since 2009, with half of the top earners working in psychiatry, and about $2 billion in total paid to doctors for such services. AstraZeneca, Johnson & Johnson and Eli Lilly have paid billions of dollars in federal settlements over allegations that they paid doctors to promote drugs for unapproved uses. Some prominent medical schools have since tightened rules on faculty acceptance of such payments by drug companies.

In contrast to this viewpoint, an article and associated editorial in the New England Journal of Medicine in May 2015 emphasized the importance of pharmaceutical industry-physician interactions for the development of novel treatments, and argued that moral outrage over industry malfeasance had unjustifiably led many to overemphasize the problems created by financial conflicts of interest. The article noted that major healthcare organizations such as National Center for Advancing Translational Sciences of the National Institutes of Health, the President's Council of Advisors on Science and Technology, the World Economic Forum, the Gates Foundation, the Wellcome Trust, and the Food and Drug Administration had encouraged greater interactions between physicians and industry in order to bring greater benefits to patients.

Product approval

In the United States, new pharmaceutical products must be approved by the Food and Drug Administration (FDA) as being both safe and effective. This process generally involves submission of an Investigational New Drug filing with sufficient pre-clinical data to support proceeding with human trials. Following IND approval, three phases of progressively larger human clinical trials may be conducted. Phase I generally studies toxicity using healthy volunteers. Phase II can include pharmacokinetics and dosing in patients, and Phase III is a very large study of efficacy in the intended patient population. Following the successful completion of phase III testing, a New Drug Application is submitted to the FDA. The FDA review the data and if the product is seen as having a positive benefit-risk assessment, approval to market the product in the US is granted.

A fourth phase of post-approval surveillance is also often required due to the fact that even the largest clinical trials cannot effectively predict the prevalence of rare side-effects. Postmarketing surveillance ensures that after marketing the safety of a drug is monitored closely. In certain instances, its indication may need to be limited to particular patient groups, and in others the substance is withdrawn from the market completely. 

The FDA provides information about approved drugs at the Orange Book site.

In the UK, the Medicines and Healthcare Products Regulatory Agency approves drugs for use, though the evaluation is done by the European Medicines Agency, an agency of the European Union based in London. Normally an approval in the UK and other European countries comes later than one in the USA. Then it is the National Institute for Health and Care Excellence (NICE), for England and Wales, who decides if and how the National Health Service (NHS) will allow (in the sense of paying for) their use. The British National Formulary is the core guide for pharmacists and clinicians.

In many non-US western countries a 'fourth hurdle' of cost effectiveness analysis has developed before new technologies can be provided. This focuses on the efficiency (in terms of the cost per QALY) of the technologies in question rather than their efficacy. In England and Wales NICE decides whether and in what circumstances drugs and technologies will be made available by the NHS, whilst similar arrangements exist with the Scottish Medicines Consortium in Scotland, and the Pharmaceutical Benefits Advisory Committee in Australia. A product must pass the threshold for cost-effectiveness if it is to be approved. Treatments must represent 'value for money' and a net benefit to society.

Orphan drugs

There are special rules for certain rare diseases ("orphan diseases") in several major drug regulatory territories. For example, diseases involving fewer than 200,000 patients in the United States, or larger populations in certain circumstances are subject to the Orphan Drug Act. Because medical research and development of drugs to treat such diseases is financially disadvantageous, companies that do so are rewarded with tax reductions, fee waivers, and market exclusivity on that drug for a limited time (seven years), regardless of whether the drug is protected by patents.

Global sales

Top 25 Drug Companies by Sales (2018)
Company Pharma Sales
($ billion)
Pfizer United States 53,370
GlaxoSmithKline United Kingdom 38,30
Sanofi-Aventis France 40,14
Roche Switzerland 27,290
AstraZeneca United Kingdom Sweden 21,45
Johnson & Johnson United States 81,38
Novartis Switzerland 52,67
Merck & Co United States 41,37
Wyeth United States 1,68
Lilly United States 24,28
Bristol-Myers Squibb United States 22,04
Boehringer Ingelheim Germany 13,860
Amgen United States 23,32
Abbott Laboratories United States 30,40
Bayer Germany 10,162
Takeda Japan 8,716
Schering-Plough United States 8,561
Teva Israel 19,75
Genentech United States 7,640
Astellas Japan 7,390
Novo Nordisk Denmark 7,087
Daiichi Sankyo Japan 6,790
Baxter International United States 6,461
Merck KGaA Germany 5,643
Eisai Japan 4,703

In 2011, global spending on prescription drugs topped $954 billion, even as growth slowed somewhat in Europe and North America. The United States accounts for more than a third of the global pharmaceutical market, with $340 billion in annual sales followed by the EU and Japan. Emerging markets such as China, Russia, South Korea and Mexico outpaced that market, growing a huge 81 percent.

The top ten best-selling drugs of 2013 totaled $75.6 billion in sales, with the anti-inflammatory drug Humira being the best-selling drug worldwide at $10.7 billion in sales. The second and third best selling were Enbrel and Remicade, respectively. The top three best-selling drugs in the United States in 2013 were Abilify ($6.3 billion,) Nexium ($6 billion) and Humira ($5.4 billion). The best-selling drug ever, Lipitor, averaged $13 billion annually and netted $141 billion total over its lifetime before Pfizer's patent expired in November 2011. 

IMS Health publishes an analysis of trends expected in the pharmaceutical industry in 2007, including increasing profits in most sectors despite loss of some patents, and new 'blockbuster' drugs on the horizon.

Patents and generics

Depending on a number of considerations, a company may apply for and be granted a patent for the drug, or the process of producing the drug, granting exclusivity rights typically for about 20 years. However, only after rigorous study and testing, which takes 10 to 15 years on average, will governmental authorities grant permission for the company to market and sell the drug. Patent protection enables the owner of the patent to recover the costs of research and development through high profit margins for the branded drug. When the patent protection for the drug expires, a generic drug is usually developed and sold by a competing company. The development and approval of generics is less expensive, allowing them to be sold at a lower price. Often the owner of the branded drug will introduce a generic version before the patent expires in order to get a head start in the generic market. Restructuring has therefore become routine, driven by the patent expiration of products launched during the industry's "golden era" in the 1990s and companies' failure to develop sufficient new blockbuster products to replace lost revenues.

Prescriptions

In the U.S., the value of prescriptions increased over the period of 1995 to 2005 by 3.4 billion annually, a 61 percent increase. Retail sales of prescription drugs jumped 250 percent from $72 billion to $250 billion, while the average price of prescriptions more than doubled from $30 to $68.

Marketing

Advertising is common in healthcare journals as well as through more mainstream media routes. In some countries, notably the US, they are allowed to advertise directly to the general public. Pharmaceutical companies generally employ sales people (often called 'drug reps' or, an older term, 'detail men') to market directly and personally to physicians and other healthcare providers. In some countries, notably the US, pharmaceutical companies also employ lobbyists to influence politicians. Marketing of prescription drugs in the US is regulated by the federal Prescription Drug Marketing Act of 1987.

To healthcare professionals

The book Bad Pharma also discusses the influence of drug representatives, how ghostwriters are employed by the drug companies to write papers for academics to publish, how independent the academic journals really are, how the drug companies finance doctors' continuing education, and how patients' groups are often funded by industry.

Direct to consumer advertising

Since the 1980s new methods of marketing for prescription drugs to consumers have become important. Direct-to-consumer media advertising was legalised in the FDA Guidance for Industry on Consumer-Directed Broadcast Advertisements.

Controversy about drug marketing and lobbying

There has been increasing controversy surrounding pharmaceutical marketing and influence. There have been accusations and findings of influence on doctors and other health professionals through drug reps including the constant provision of marketing 'gifts' and biased information to health professionals; highly prevalent advertising in journals and conferences; funding independent healthcare organizations and health promotion campaigns; lobbying physicians and politicians (more than any other industry in the US); sponsorship of medical schools or nurse training; sponsorship of continuing educational events, with influence on the curriculum; and hiring physicians as paid consultants on medical advisory boards. 

Some advocacy groups, such as No Free Lunch and AllTrials, have criticized the effect of drug marketing to physicians because they say it biases physicians to prescribe the marketed drugs even when others might be cheaper or better for the patient.

There have been related accusations of disease mongering (over-medicalising) to expand the market for medications. An inaugural conference on that subject took place in Australia in 2006. In 2009, the Government-funded National Prescribing Service launched the "Finding Evidence – Recognizing Hype" program, aimed at educating GPs on methods for independent drug analysis.

A 2005 review by a special committee of the UK government came to all the above conclusions in a European Union context whilst also highlighting the contributions and needs of the industry.

Meta-analyses have shown that psychiatric studies sponsored by pharmaceutical companies are several times more likely to report positive results, and if a drug company employee is involved the effect is even larger. Influence has also extended to the training of doctors and nurses in medical schools, which is being fought.

It has been argued that the design of the Diagnostic and Statistical Manual of Mental Disorders and the expansion of the criteria represents an increasing medicalization of human nature, or "disease mongering", driven by drug company influence on psychiatry. The potential for direct conflict of interest has been raised, partly because roughly half the authors who selected and defined the DSM-IV psychiatric disorders had or previously had financial relationships with the pharmaceutical industry.

In the US, starting in 2013, under the Physician Financial Transparency Reports (part of the Sunshine Act), the Centers for Medicare & Medicaid Services has to collect information from applicable manufacturers and group purchasing organizations in order to report information about their financial relationships with physicians and hospitals. Data are made public in the Centers for Medicare & Medicaid Services website. The expectation is that relationship between doctors and Pharmaceutical industry will become fully transparent.

In a report conducted by the Center for Responsive Politics, there were more than 1,100 lobbyists working in some capacity for the pharmaceutical business in 2017. In the first quarter of 2017, the health products and pharmaceutical industry spent $78 million on lobbying member of the United States Congress.

Regulatory issues

Ben Goldacre has argued that regulators – such as the Medicines and Healthcare products Regulatory Agency (MHRA) in the UK, or the Food and Drug Administration (FDA) in the United States – advance the interests of the drug companies rather than the interests of the public due to revolving door exchange of employees between the regulator and the companies and friendships develop between regulator and company employees. He argues that regulators do not require that new drugs offer an improvement over what is already available, or even that they be particularly effective.

Others have argued that excessive regulation suppresses therapeutic innovation, and that the current cost of regulator-required clinical trials prevents the full exploitation of new genetic and biological knowledge for the treatment of human disease. A 2012 report by the President's Council of Advisors on Science and Technology made several key recommendations to reduce regulatory burdens to new drug development, including 1) expanding the FDA's use of accelerated approval processes, 2) creating an expedited approval pathway for drugs intended for use in narrowly defined populations, and 3) undertaking pilot projects designed to evaluate the feasibility of a new, adaptive drug approval process.

Pharmaceutical fraud

Pharmaceutical fraud involves deceptions which bring financial gain to a pharmaceutical company. It affects individuals and public and private insurers. There are several different schemes used to defraud the health care system which are particular to the pharmaceutical industry. These include: Good Manufacturing Practice (GMP) Violations, Off Label Marketing, Best Price Fraud, CME Fraud, Medicaid Price Reporting, and Manufactured Compound Drugs. Of this amount $2.5 billion was recovered through False Claims Act cases in FY 2010. Examples of fraud cases include the GlaxoSmithKline $3 billion settlement, Pfizer $2.3 billion settlement and Merck & Co. $650 million settlement. Damages from fraud can be recovered by use of the False Claims Act, most commonly under the qui tam provisions which rewards an individual for being a "whistleblower", or relator (law).

Every major company selling the antipsychotics — Bristol-Myers Squibb, Eli Lilly, Pfizer, AstraZeneca and Johnson & Johnson — has either settled recent government cases, under the False Claims Act, for hundreds of millions of dollars or is currently under investigation for possible health care fraud. Following charges of illegal marketing, two of the settlements set records last year for the largest criminal fines ever imposed on corporations. One involved Eli Lilly's antipsychotic Zyprexa, and the other involved Bextra. In the Bextra case, the government also charged Pfizer with illegally marketing another antipsychotic, Geodon; Pfizer settled that part of the claim for $301 million, without admitting any wrongdoing.

On 2 July 2012, GlaxoSmithKline pleaded guilty to criminal charges and agreed to a $3 billion settlement of the largest health-care fraud case in the U.S. and the largest payment by a drug company. The settlement is related to the company's illegal promotion of prescription drugs, its failure to report safety data, bribing doctors, and promoting medicines for uses for which they were not licensed. The drugs involved were Paxil, Wellbutrin, Advair, Lamictal, and Zofran for off-label, non-covered uses. Those and the drugs Imitrex, Lotronex, Flovent, and Valtrex were involved in the kickback scheme.

The following is a list of the four largest settlements reached with pharmaceutical companies from 1991 to 2012, rank ordered by the size of the total settlement. Legal claims against the pharmaceutical industry have varied widely over the past two decades, including Medicare and Medicaid fraud, off-label promotion, and inadequate manufacturing practices.

Company Settlement Violation(s) Year Product(s) Laws allegedly violated
(if applicable)
GlaxoSmithKline $3 billion Off-label promotion/
failure to disclose safety data
2012 Avandia/Wellbutrin/Paxil False Claims Act/FDCA
Pfizer $2.3 billion Off-label promotion/kickbacks 2009 Bextra/Geodon/
Zyvox/Lyrica
False Claims Act/FDCA
Abbott Laboratories $1.5 billion Off-label promotion 2012 Depakote False Claims Act/FDCA
Eli Lilly $1.4 billion Off-label promotion 2009 Zyprexa False Claims Act/FDCA

Developing world

Patents

Patents have been criticized in the developing world, as they are thought to reduce access to existing medicines. Reconciling patents and universal access to medicine would require an efficient international policy of price discrimination. Moreover, under the TRIPS agreement of the World Trade Organization, countries must allow pharmaceutical products to be patented. In 2001, the WTO adopted the Doha Declaration, which indicates that the TRIPS agreement should be read with the goals of public health in mind, and allows some methods for circumventing pharmaceutical monopolies: via compulsory licensing or parallel imports, even before patent expiration.

In March 2001, 40 multi-national pharmaceutical companies brought litigation against South Africa for its Medicines Act, which allowed the generic production of antiretroviral drugs (ARVs) for treating HIV, despite the fact that these drugs were on-patent. HIV was and is an epidemic in South Africa, and ARVs at the time cost between 10,000 and US$15,000 per patient per year. This was unaffordable for most South African citizens, and so the South African government committed to providing ARVs at prices closer to what people could afford. To do so, they would need to ignore the patents on drugs and produce generics within the country (using a compulsory license), or import them from abroad. After international protest in favour of public health rights (including the collection of 250,000 signatures by MSF), the governments of several developed countries (including The Netherlands, Germany, France, and later the US) backed the South African government, and the case was dropped in April of that year.

In 2016, GlaxoSmithKline (the worlds 6th largest Pharmaceutical) announced that it would be dropping its patents in poor countries so as to allow independent companies to make and sell versions of its drugs in those areas, thereby widening the public access to them. GlaxoSmithKline published a list of 50 countries they would no longer hold patents in, affecting 1 billion people worldwide.

Charitable programs

In 2011 four of the top 20 corporate charitable donations and eight of the top 30 corporate charitable donations came from pharmaceutical manufacturers. The bulk of corporate charitable donations (69% as of 2012) comes by way of non-cash charitable donations, the majority of which again were donations contributed by pharmaceutical companies. Some of those large pharmaceutical companies are “patient assistance” foundations, providing financial support to individuals in purchasing prescription medicines, but pharmaceutical companies are also huge givers of in-kind products, i.e. presumably their own drugs. Non-cash donations of product can be "profit maximizing…as part of an inventory control issue when they have excess inventories” for some corporations, says Patrick Rooney of Giving USA in an interview with Nonprofit Quarterly.

Charitable programs and drug discovery & development efforts by pharmaceutical companies include:
  • "Merck's Gift", wherein billions of river blindness drugs were donated in Africa
  • Pfizer's gift of free/discounted fluconazole and other drugs for AIDS in South Africa
  • GSK's commitment to give free albendazole tablets to the WHO for, and until, the elimination of lymphatic filariasis worldwide.
  • In 2006, Novartis committed US$755 million in corporate citizenship initiatives around the world, particularly focusing on improving access to medicines in the developing world through its Access to Medicine projects, including donations of medicines to patients affected by leprosy, tuberculosis, and malaria; Glivec patient assistance programs; and relief to support major humanitarian organisations with emergency medical needs.

Gene expression programming

From Wikipedia, the free encyclopedia


In computer programming, gene expression programming (GEP) is an evolutionary algorithm that creates computer programs or models. These computer programs are complex tree structures that learn and adapt by changing their sizes, shapes, and composition, much like a living organism. And like living organisms, the computer programs of GEP are also encoded in simple linear chromosomes of fixed length. Thus, GEP is a genotype–phenotype system, benefiting from a simple genome to keep and transmit the genetic information and a complex phenotype to explore the environment and adapt to it.

Background

Evolutionary algorithms use populations of individuals, select individuals according to fitness, and introduce genetic variation using one or more genetic operators. Their use in artificial computational systems dates back to the 1950s where they were used to solve optimization problems (e.g. Box 1957 and Friedman 1959). But it was with the introduction of evolution strategies by Rechenberg in 1965 that evolutionary algorithms gained popularity. A good overview text on evolutionary algorithms is the book "An Introduction to Genetic Algorithms" by Mitchell (1996).

Gene expression programming belongs to the family of evolutionary algorithms and is closely related to genetic algorithms and genetic programming. From genetic algorithms it inherited the linear chromosomes of fixed length; and from genetic programming it inherited the expressive parse trees of varied sizes and shapes. 

In gene expression programming the linear chromosomes work as the genotype and the parse trees as the phenotype, creating a genotype/phenotype system. This genotype/phenotype system is multigenic, thus encoding multiple parse trees in each chromosome. This means that the computer programs created by GEP are composed of multiple parse trees. Because these parse trees are the result of gene expression, in GEP they are called expression trees.

Encoding: the genotype

The genome of gene expression programming consists of a linear, symbolic string or chromosome of fixed length composed of one or more genes of equal size. These genes, despite their fixed length, code for expression trees of different sizes and shapes. An example of a chromosome with two genes, each of size 9, is the string (position zero indicates the start of each gene): 

012345678012345678


L+a-baccd**cLabacd


where “L” represents the natural logarithm function and “a”, “b”, “c”, and “d” represent the variables and constants used in a problem.

Expression trees: the phenotype

As shown above, the genes of gene expression programming have all the same size. However, these fixed length strings code for expression trees of different sizes. This means that the size of the coding regions varies from gene to gene, allowing for adaptation and evolution to occur smoothly.
For example, the mathematical expression:
can also be represented as an expression tree: 

GEP expression tree, k-expression Q*-+abcd.png
where "Q” represents the square root function.

This kind of expression tree consists of the phenotypic expression of GEP genes, whereas the genes are linear strings encoding these complex structures. For this particular example, the linear string corresponds to: 

01234567


Q*-+abcd


which is the straightforward reading of the expression tree from top to bottom and from left to right. These linear strings are called k-expressions (from Karva notation). 

Going from k-expressions to expression trees is also very simple. For example, the following k-expression: 

01234567890


Q*b**+baQba


is composed of two different terminals (the variables “a” and “b”), two different functions of two arguments (“*” and “+”), and a function of one argument (“Q”). Its expression gives: 

GEP expression tree, k-expression Q*b**+baQba.png

K-expressions and genes

The k-expressions of gene expression programming correspond to the region of genes that gets expressed. This means that there might be sequences in the genes that are not expressed, which is indeed true for most genes. The reason for these noncoding regions is to provide a buffer of terminals so that all k-expressions encoded in GEP genes correspond always to valid programs or expressions.
The genes of gene expression programming are therefore composed of two different domains – a head and a tail – each with different properties and functions. The head is used mainly to encode the functions and variables chosen to solve the problem at hand, whereas the tail, while also used to encode the variables, provides essentially a reservoir of terminals to ensure that all programs are error-free. 

For GEP genes the length of the tail is given by the formula:
where h is the head’s length and nmax is maximum arity. For example, for a gene created using the set of functions F = {Q, +, −, *, /} and the set of terminals T = {a, b}, nmax = 2. And if we choose a head length of 15, then t = 15 (2–1) + 1 = 16, which gives a gene length g of 15 + 16 = 31. The randomly generated string below is an example of one such gene: 

0123456789012345678901234567890


*b+a-aQab+//+b+babbabbbababbaaa


It encodes the expression tree:
GEP expression tree, k-expression *b+a-aQa.png

which, in this case, only uses 8 of the 31 elements that constitute the gene.

It's not hard to see that, despite their fixed length, each gene has the potential to code for expression trees of different sizes and shapes, with the simplest composed of only one node (when the first element of a gene is a terminal) and the largest composed of as many nodes as there are elements in the gene (when all the elements in the head are functions with maximum arity).

It's also not hard to see that it is trivial to implement all kinds of genetic modification (mutation, inversion, insertion, recombination, and so on) with the guarantee that all resulting offspring encode correct, error-free programs.

Multigenic chromosomes

The chromosomes of gene expression programming are usually composed of more than one gene of equal length. Each gene codes for a sub-expression tree (sub-ET) or sub-program. Then the sub-ETs can interact with one another in different ways, forming a more complex program. The figure shows an example of a program composed of three sub-ETs. 

Expression of GEP genes as sub-ETs. a) A three-genic chromosome with the tails shown in bold. b) The sub-ETs encoded by each gene.
In the final program the sub-ETs could be linked by addition or some other function, as there are no restrictions to the kind of linking function one might choose. Some examples of more complex linkers include taking the average, the median, the midrange, thresholding their sum to make a binomial classification, applying the sigmoid function to compute a probability, and so on. These linking functions are usually chosen a priori for each problem, but they can also be evolved elegantly and efficiently by the cellular system of gene expression programming.

Cells and code reuse

In gene expression programming, homeotic genes control the interactions of the different sub-ETs or modules of the main program. The expression of such genes results in different main programs or cells, that is, they determine which genes are expressed in each cell and how the sub-ETs of each cell interact with one another. In other words, homeotic genes determine which sub-ETs are called upon and how often in which main program or cell and what kind of connections they establish with one another.

Homeotic genes and the cellular system

Homeotic genes have exactly the same kind of structural organization as normal genes and they are built using an identical process. They also contain a head domain and a tail domain, with the difference that the heads contain now linking functions and a special kind of terminals – genic terminals – that represent the normal genes. The expression of the normal genes results as usual in different sub-ETs, which in the cellular system are called ADFs (automatically defined functions). As for the tails, they contain only genic terminals, that is, derived features generated on the fly by the algorithm. 

For example, the chromosome in the figure has three normal genes and one homeotic gene and encodes a main program that invokes three different functions a total of four times, linking them in a particular way. 

Expression of a unicellular system with three ADFs. a) The chromosome composed of three conventional genes and one homeotic gene (shown in bold). b) The ADFs encoded by each conventional gene. c) The main program or cell.
From this example it is clear that the cellular system not only allows the unconstrained evolution of linking functions but also code reuse. And it shouldn't be hard to implement recursion in this system.

Multiple main programs and multicellular systems

Multicellular systems are composed of more than one homeotic gene. Each homeotic gene in this system puts together a different combination of sub-expression trees or ADFs, creating multiple cells or main programs. 

For example, the program shown in the figure was created using a cellular system with two cells and three normal genes. 

Expression of a multicellular system with three ADFs and two main programs. a) The chromosome composed of three conventional genes and two homeotic genes (shown in bold). b) The ADFs encoded by each conventional gene. c) Two different main programs expressed in two different cells.
The applications of these multicellular systems are multiple and varied and, like the multigenic systems, they can be used both in problems with just one output and in problems with multiple outputs.

Other levels of complexity

The head/tail domain of GEP genes (both normal and homeotic) is the basic building block of all GEP algorithms. However, gene expression programming also explores other chromosomal organizations that are more complex than the head/tail structure. Essentially these complex structures consist of functional units or genes with a basic head/tail domain plus one or more extra domains. These extra domains usually encode random numerical constants that the algorithm relentlessly fine-tunes in order to find a good solution. For instance, these numerical constants may be the weights or factors in a function approximation problem (see the GEP-RNC algorithm below); they may be the weights and thresholds of a neural network (see the GEP-NN algorithm below); the numerical constants needed for the design of decision trees (see the GEP-DT algorithm below); the weights needed for polynomial induction; or the random numerical constants used to discover the parameter values in a parameter optimization task.

The basic gene expression algorithm

The fundamental steps of the basic gene expression algorithm are listed below in pseudocode:
1. Select function set;
2. Select terminal set;
3. Load dataset for fitness evaluation;
4. Create chromosomes of initial population randomly;
5. For each program in population:
a) Express chromosome;
b) Execute program;
c) Evaluate fitness;
6. Verify stop condition;
7. Select programs;
8. Replicate selected programs to form the next population;
9. Modify chromosomes using genetic operators;
10.Go to step 5.
The first four steps prepare all the ingredients that are needed for the iterative loop of the algorithm (steps 5 through 10). Of these preparative steps, the crucial one is the creation of the initial population, which is created randomly using the elements of the function and terminal sets.

Populations of programs

Like all evolutionary algorithms, gene expression programming works with populations of individuals, which in this case are computer programs. Therefore, some kind of initial population must be created to get things started. Subsequent populations are descendants, via selection and genetic modification, of the initial population. 

In the genotype/phenotype system of gene expression programming, it is only necessary to create the simple linear chromosomes of the individuals without worrying about the structural soundness of the programs they code for, as their expression always results in syntactically correct programs.

Fitness functions and the selection environment

Fitness functions and selection environments (called training datasets in machine learning) are the two facets of fitness and are therefore intricately connected. Indeed, the fitness of a program depends not only on the cost function used to measure its performance but also on the training data chosen to evaluate fitness

The selection environment or training data

The selection environment consists of the set of training records, which are also called fitness cases. These fitness cases could be a set of observations or measurements concerning some problem, and they form what is called the training dataset. 

The quality of the training data is essential for the evolution of good solutions. A good training set should be representative of the problem at hand and also well-balanced, otherwise the algorithm might get stuck at some local optimum. In addition, it is also important to avoid using unnecessarily large datasets for training as this will slow things down unnecessarily. A good rule of thumb is to choose enough records for training to enable a good generalization in the validation data and leave the remaining records for validation and testing.

Fitness functions

Broadly speaking, there are essentially three different kinds of problems based on the kind of prediction being made:
  1. Problems involving numeric (continuous) predictions;
  2. Problems involving categorical or nominal predictions, both binomial and multinomial;
  3. Problems involving binary or Boolean predictions.
The first type of problem goes by the name of regression; the second is known as classification, with logistic regression as a special case where, besides the crisp classifications like "Yes" or "No", a probability is also attached to each outcome; and the last one is related to Boolean algebra and logic synthesis.
Fitness functions for regression
In regression, the response or dependent variable is numeric (usually continuous) and therefore the output of a regression model is also continuous. So it's quite straightforward to evaluate the fitness of the evolving models by comparing the output of the model to the value of the response in the training data.

There are several basic fitness functions for evaluating model performance, with the most common being based on the error or residual between the model output and the actual value. Such functions include the mean squared error, root mean squared error, mean absolute error, relative squared error, root relative squared error, relative absolute error, and others.

All these standard measures offer a fine granularity or smoothness to the solution space and therefore work very well for most applications. But some problems might require a coarser evolution, such as determining if a prediction is within a certain interval, for instance less than 10% of the actual value. However, even if one is only interested in counting the hits (that is, a prediction that is within the chosen interval), making populations of models evolve based on just the number of hits each program scores is usually not very efficient due to the coarse granularity of the fitness landscape. Thus the solution usually involves combining these coarse measures with some kind of smooth function such as the standard error measures listed above. 

Fitness functions based on the correlation coefficient and R-square are also very smooth. For regression problems, these functions work best by combining them with other measures because, by themselves, they only tend to measure correlation, not caring for the range of values of the model output. So by combining them with functions that work at approximating the range of the target values, they form very efficient fitness functions for finding models with good correlation and good fit between predicted and actual values.
Fitness functions for classification and logistic regression
The design of fitness functions for classification and logistic regression takes advantage of three different characteristics of classification models. The most obvious is just counting the hits, that is, if a record is classified correctly it is counted as a hit. This fitness function is very simple and works well for simple problems, but for more complex problems or datasets highly unbalanced it gives poor results. 

One way to improve this type of hits-based fitness function consists of expanding the notion of correct and incorrect classifications. In a binary classification task, correct classifications can be 00 or 11. The "00" representation means that a negative case (represented by "0”) was correctly classified, whereas the "11" means that a positive case (represented by "1”) was correctly classified. Classifications of the type "00" are called true negatives (TN) and "11" true positives (TP). 

There are also two types of incorrect classifications and they are represented by 01 and 10. They are called false positives (FP) when the actual value is 0 and the model predicts a 1; and false negatives (FN) when the target is 1 and the model predicts a 0. The counts of TP, TN, FP, and FN are usually kept on a table known as the confusion matrix

Confusion matrix for a binomial classification task.
So by counting the TP, TN, FP, and FN and further assigning different weights to these four types of classifications, it is possible to create smoother and therefore more efficient fitness functions. Some popular fitness functions based on the confusion matrix include sensitivity/specificity, recall/precision, F-measure, Jaccard similarity, Matthews correlation coefficient, and cost/gain matrix which combines the costs and gains assigned to the 4 different types of classifications. 

These functions based on the confusion matrix are quite sophisticated and are adequate to solve most problems efficiently. But there is another dimension to classification models which is key to exploring more efficiently the solution space and therefore results in the discovery of better classifiers. This new dimension involves exploring the structure of the model itself, which includes not only the domain and range, but also the distribution of the model output and the classifier margin.
By exploring this other dimension of classification models and then combining the information about the model with the confusion matrix, it is possible to design very sophisticated fitness functions that allow the smooth exploration of the solution space. For instance, one can combine some measure based on the confusion matrix with the mean squared error evaluated between the raw model outputs and the actual values. Or combine the F-measure with the R-square evaluated for the raw model output and the target; or the cost/gain matrix with the correlation coefficient, and so on. More exotic fitness functions that explore model granularity include the area under the ROC curve and rank measure.

Also related to this new dimension of classification models, is the idea of assigning probabilities to the model output, which is what is done in logistic regression. Then it is also possible to use these probabilities and evaluate the mean squared error (or some other similar measure) between the probabilities and the actual values, then combine this with the confusion matrix to create very efficient fitness functions for logistic regression. Popular examples of fitness functions based on the probabilities include maximum likelihood estimation and hinge loss.
Fitness functions for Boolean problems
In logic there is no model structure (as defined above for classification and logistic regression) to explore: the domain and range of logical functions comprises only 0’s and 1’s or false and true. So, the fitness functions available for Boolean algebra can only be based on the hits or on the confusion matrix as explained in the section above.

Selection and elitism

Roulette-wheel selection is perhaps the most popular selection scheme used in evolutionary computation. It involves mapping the fitness of each program to a slice of the roulette wheel proportional to its fitness. Then the roulette is spun as many times as there are programs in the population in order to keep the population size constant. So, with roulette-wheel selection programs are selected both according to fitness and the luck of the draw, which means that some times the best traits might be lost. However, by combining roulette-wheel selection with the cloning of the best program of each generation, one guarantees that at least the very best traits are not lost. This technique of cloning the best-of-generation program is known as simple elitism and is used by most stochastic selection schemes.

Reproduction with modification

The reproduction of programs involves first the selection and then the reproduction of their genomes. Genome modification is not required for reproduction, but without it adaptation and evolution won't take place.

Replication and selection

The selection operator selects the programs for the replication operator to copy. Depending on the selection scheme, the number of copies one program originates may vary, with some programs getting copied more than once while others are copied just once or not at all. In addition, selection is usually set up so that the population size remains constant from one generation to another. 

The replication of genomes in nature is very complex and it took scientists a long time to discover the DNA double helix and propose a mechanism for its replication. But the replication of strings is trivial in artificial evolutionary systems, where only an instruction to copy strings is required to pass all the information in the genome from generation to generation. 

The replication of the selected programs is a fundamental piece of all artificial evolutionary systems, but for evolution to occur it needs to be implemented not with the usual precision of a copy instruction, but rather with a few errors thrown in. Indeed, genetic diversity is created with genetic operators such as mutation, recombination, transposition, inversion, and many others.

Mutation

In gene expression programming mutation is the most important genetic operator. It changes genomes by changing an element by another. The accumulation of many small changes over time can create great diversity.

In gene expression programming mutation is totally unconstrained, which means that in each gene domain any domain symbol can be replaced by another. For example, in the heads of genes any function can be replaced by a terminal or another function, regardless of the number of arguments in this new function; and a terminal can be replaced by a function or another terminal.

Recombination

Recombination usually involves two parent chromosomes to create two new chromosomes by combining different parts from the parent chromosomes. And as long as the parent chromosomes are aligned and the exchanged fragments are homologous (that is, occupy the same position in the chromosome), the new chromosomes created by recombination will always encode syntactically correct programs. 

Different kinds of crossover are easily implemented either by changing the number of parents involved (there's no reason for choosing only two); the number of split points; or the way one chooses to exchange the fragments, for example, either randomly or in some orderly fashion. For example, gene recombination, which is a special case of recombination, can be done by exchanging homologous genes (genes that occupy the same position in the chromosome) or by exchanging genes chosen at random from any position in the chromosome.

Transposition

Transposition involves the introduction of an insertion sequence somewhere in a chromosome. In gene expression programming insertion sequences might appear anywhere in the chromosome, but they are only inserted in the heads of genes. This method guarantees that even insertion sequences from the tails result in error-free programs.

For transposition to work properly, it must preserve chromosome length and gene structure. So, in gene expression programming transposition can be implemented using two different methods: the first creates a shift at the insertion site, followed by a deletion at the end of the head; the second overwrites the local sequence at the target site and therefore is easier to implement. Both methods can be implemented to operate between chromosomes or within a chromosome or even within a single gene.

Inversion

Inversion is an interesting operator, especially powerful for combinatorial optimization. It consists of inverting a small sequence within a chromosome. 

In gene expression programming it can be easily implemented in all gene domains and, in all cases, the offspring produced is always syntactically correct. For any gene domain, a sequence (ranging from at least two elements to as big as the domain itself) is chosen at random within that domain and then inverted.

Other genetic operators

Several other genetic operators exist and in gene expression programming, with its different genes and gene domains, the possibilities are endless. For example, genetic operators such as one-point recombination, two-point recombination, gene recombination, uniform recombination, gene transposition, root transposition, domain-specific mutation, domain-specific inversion, domain-specific transposition, and so on, are easily implemented and widely used.

The GEP-RNC algorithm

Numerical constants are essential elements of mathematical and statistical models and therefore it is important to allow their integration in the models designed by evolutionary algorithms. 

Gene expression programming solves this problem very elegantly through the use of an extra gene domain – the Dc – for handling random numerical constants (RNC). By combining this domain with a special terminal placeholder for the RNCs, a richly expressive system can be created. 

Structurally, the Dc comes after the tail, has a length equal to the size of the tail t, and is composed of the symbols used to represent the RNCs. 

For example, below is shown a simple chromosome composed of only one gene a head size of 7 (the Dc stretches over positions 15–22): 

01234567890123456789012


+?*+?**aaa??aaa68083295


where the terminal "?” represents the placeholder for the RNCs. This kind of chromosome is expressed exactly as shown above, giving: 

GEP expression tree with placeholder for RNCs.png

Then the ?'s in the expression tree are replaced from left to right and from top to bottom by the symbols (for simplicity represented by numerals) in the Dc, giving: 

GEP expression tree with symbols (numerals) for RNCs.png

The values corresponding to these symbols are kept in an array. (For simplicity, the number represented by the numeral indicates the order in the array.) For instance, for the following 10 element array of RNCs:
C = {0.611, 1.184, 2.449, 2.98, 0.496, 2.286, 0.93, 2.305, 2.737, 0.755}
the expression tree above gives:

GEP expression tree with RNCs.png

This elegant structure for handling random numerical constants is at the heart of different GEP systems, such as GEP neural networks and GEP decision trees.

Like the basic gene expression algorithm, the GEP-RNC algorithm is also multigenic and its chromosomes are decoded as usual by expressing one gene after another and then linking them all together by the same kind of linking process.

The genetic operators used in the GEP-RNC system are an extension to the genetic operators of the basic GEP algorithm (see above), and they all can be straightforwardly implemented in these new chromosomes. On the other hand, the basic operators of mutation, inversion, transposition, and recombination are also used in the GEP-RNC algorithm. Furthermore, special Dc-specific operators such as mutation, inversion, and transposition, are also used to aid in a more efficient circulation of the RNCs among individual programs. In addition, there is also a special mutation operator that allows the permanent introduction of variation in the set of RNCs. The initial set of RNCs is randomly created at the beginning of a run, which means that, for each gene in the initial population, a specified number of numerical constants, chosen from a certain range, are randomly generated. Then their circulation and mutation is enabled by the genetic operators.

Neural networks

An artificial neural network (ANN or NN) is a computational device that consists of many simple connected units or neurons. The connections between the units are usually weighted by real-valued weights. These weights are the primary means of learning in neural networks and a learning algorithm is usually used to adjust them. 

Structurally, a neural network has three different classes of units: input units, hidden units, and output units. An activation pattern is presented at the input units and then spreads in a forward direction from the input units through one or more layers of hidden units to the output units. The activation coming into one unit from other unit is multiplied by the weights on the links over which it spreads. All incoming activation is then added together and the unit becomes activated only if the incoming result is above the unit’s threshold. 

In summary, the basic components of a neural network are the units, the connections between the units, the weights, and the thresholds. So, in order to fully simulate an artificial neural network one must somehow encode these components in a linear chromosome and then be able to express them in a meaningful way. 

In GEP neural networks (GEP-NN or GEP nets), the network architecture is encoded in the usual structure of a head/tail domain. The head contains special functions/neurons that activate the hidden and output units (in the GEP context, all these units are more appropriately called functional units) and terminals that represent the input units. The tail, as usual, contains only terminals/input units. 

Besides the head and the tail, these neural network genes contain two additional domains, Dw and Dt, for encoding the weights and thresholds of the neural network. Structurally, the Dw comes after the tail and its length dw depends on the head size h and maximum arity nmax and is evaluated by the formula:
The Dt comes after Dw and has a length dt equal to t. Both domains are composed of symbols representing the weights and thresholds of the neural network. 

For each NN-gene, the weights and thresholds are created at the beginning of each run, but their circulation and adaptation are guaranteed by the usual genetic operators of mutation, transposition, inversion, and recombination. In addition, special operators are also used to allow a constant flow of genetic variation in the set of weights and thresholds. 

For example, below is shown a neural network with two input units (i1 and i2), two hidden units (h1 and h2), and one output unit (o1). It has a total of six connections with six corresponding weights represented by the numerals 1–6 (for simplicity, the thresholds are all equal to 1 and are omitted): 

Neural network with 5 units.png

This representation is the canonical neural network representation, but neural networks can also be represented by a tree, which, in this case, corresponds to: 

GEP neural network with 7 nodes.png

where "a” and "b” represent the two inputs i1 and i2 and "D” represents a function with connectivity two. This function adds all its weighted arguments and then thresholds this activation in order to determine the forwarded output. This output (zero or one in this simple case) depends on the threshold of each unit, that is, if the total incoming activation is equal to or greater than the threshold, then the output is one, zero otherwise.

The above NN-tree can be linearized as follows: 

0123456789012


DDDabab654321


where the structure in positions 7–12 (Dw) encodes the weights. The values of each weight are kept in an array and retrieved as necessary for expression. 

As a more concrete example, below is shown a neural net gene for the exclusive-or problem. It has a head size of 3 and Dw size of 6: 

0123456789012


DDDabab393257


Its expression results in the following neural network: 

Expression of a GEP neural network for the exclusive-or.png

which, for the set of weights:
W = {−1.978, 0.514, −0.465, 1.22, −1.686, −1.797, 0.197, 1.606, 0, 1.753}
it gives: 

GEP neural network solution for the exclusive-or.png

which is a perfect solution to the exclusive-or function.

Besides simple Boolean functions with binary inputs and binary outputs, the GEP-nets algorithm can handle all kinds of functions or neurons (linear neuron, tanh neuron, atan neuron, logistic neuron, limit neuron, radial basis and triangular basis neurons, all kinds of step neurons, and so on). Also interesting is that the GEP-nets algorithm can use all these neurons together and let evolution decide which ones work best to solve the problem at hand. So, GEP-nets can be used not only in Boolean problems but also in logistic regression, classification, and regression. In all cases, GEP-nets can be implemented not only with multigenic systems but also cellular systems, both unicellular and multicellular. Furthermore, multinomial classification problems can also be tackled in one go by GEP-nets both with multigenic systems and multicellular systems.

Decision trees

Decision trees (DT) are classification models where a series of questions and answers are mapped using nodes and directed edges. 

Decision trees have three types of nodes: a root node, internal nodes, and leaf or terminal nodes. The root node and all internal nodes represent test conditions for different attributes or variables in a dataset. Leaf nodes specify the class label for all different paths in the tree.

Most decision tree induction algorithms involve selecting an attribute for the root node and then make the same kind of informed decision about all the nodes in a tree. 

Decision trees can also be created by gene expression programming, with the advantage that all the decisions concerning the growth of the tree are made by the algorithm itself without any kind of human input.

There are basically two different types of DT algorithms: one for inducing decision trees with only nominal attributes and another for inducing decision trees with both numeric and nominal attributes. This aspect of decision tree induction also carries to gene expression programming and there are two GEP algorithms for decision tree induction: the evolvable decision trees (EDT) algorithm for dealing exclusively with nominal attributes and the EDT-RNC (EDT with random numerical constants) for handling both nominal and numeric attributes. 

In the decision trees induced by gene expression programming, the attributes behave as function nodes in the basic gene expression algorithm, whereas the class labels behave as terminals. This means that attribute nodes have also associated with them a specific arity or number of branches that will determine their growth and, ultimately, the growth of the tree. Class labels behave like terminals, which means that for a k-class classification task, a terminal set with k terminals is used, representing the k different classes. 

The rules for encoding a decision tree in a linear genome are very similar to the rules used to encode mathematical expressions. So, for decision tree induction the genes also have a head and a tail, with the head containing attributes and terminals and the tail containing only terminals. This again ensures that all decision trees designed by GEP are always valid programs. Furthermore, the size of the tail t is also dictated by the head size h and the number of branches of the attribute with more branches nmax and is evaluated by the equation:
For example, consider the decision tree below to decide whether to play outside: 

Decision tree for playing outside.png

It can be linearly encoded as: 

01234567


HOWbaaba


where “H” represents the attribute Humidity, “O” the attribute Outlook, “W” represents Windy, and “a” and “b” the class labels "Yes" and "No" respectively. Note that the edges connecting the nodes are properties of the data, specifying the type and number of branches of each attribute, and therefore don’t have to be encoded.

The process of decision tree induction with gene expression programming starts, as usual, with an initial population of randomly created chromosomes. Then the chromosomes are expressed as decision trees and their fitness evaluated against a training dataset. According to fitness they are then selected to reproduce with modification. The genetic operators are exactly the same that are used in a conventional unigenic system, for example, mutation, inversion, transposition, and recombination.

Decision trees with both nominal and numeric attributes are also easily induced with gene expression programming using the framework described above for dealing with random numerical constants. The chromosomal architecture includes an extra domain for encoding random numerical constants, which are used as thresholds for splitting the data at each branching node. For example, the gene below with a head size of 5 (the Dc starts at position 16): 

012345678901234567890


WOTHabababbbabba46336


encodes the decision tree shown below: 

GEP decision tree, k-expression WOTHababab.png

In this system, every node in the head, irrespective of its type (numeric attribute, nominal attribute, or terminal), has associated with it a random numerical constant, which for simplicity in the example above is represented by a numeral 0–9. These random numerical constants are encoded in the Dc domain and their expression follows a very simple scheme: from top to bottom and from left to right, the elements in Dc are assigned one-by-one to the elements in the decision tree. So, for the following array of RNCs:
C = {62, 51, 68, 83, 86, 41, 43, 44, 9, 67}
the decision tree above results in: 

GEP decision tree with numeric and nominal attributes, k-expression WOTHababab.png

which can also be represented more colorfully as a conventional decision tree: 

GEP decision tree with numeric and nominal attributes.png

Criticism

GEP has been criticized for not being a major improvement over other genetic programming techniques. In many experiments, it did not perform better than existing methods.

Software

Commercial applications

GeneXproTools
GeneXproTools is a predictive analytics suite developed by Gepsoft. GeneXproTools modeling frameworks include logistic regression, classification, regression, time series prediction, and logic synthesis. GeneXproTools implements the basic gene expression algorithm and the GEP-RNC algorithm, both used in all the modeling frameworks of GeneXproTools.

Open-source libraries

GEP4J – GEP for Java Project
Created by Jason Thomas, GEP4J is an open-source implementation of gene expression programming in Java. It implements different GEP algorithms, including evolving decision trees (with nominal, numeric, or mixed attributes) and automatically defined functions. GEP4J is hosted at Google Code.
PyGEP – Gene Expression Programming for Python
Created by Ryan O'Neil with the goal to create a simple library suitable for the academic study of gene expression programming in Python, aiming for ease of use and rapid implementation. It implements standard multigenic chromosomes and the genetic operators mutation, crossover, and transposition. PyGEP is hosted at Google Code.
jGEP – Java GEP toolkit
Created by Matthew Sottile to rapidly build Java prototype codes that use GEP, which can then be written in a language such as C or Fortran for real speed. jGEP is hosted at SourceForge.

Further reading


  • Ferreira, C. (2006). Gene Expression Programming: Mathematical Modeling by an Artificial Intelligence. Springer-Verlag. ISBN 3-540-32796-7.
  • Ferreira, C. (2002). Gene Expression Programming: Mathematical Modeling by an Artificial Intelligence. Portugal: Angra do Heroismo. ISBN 972-95890-5-4.

  • Neurosurgery

    From Wikipedia, the free encyclopedia https://en.wikipedia.org/wiki/Neurosurg...