Search This Blog

Saturday, March 27, 2021

Search engine

From Wikipedia, the free encyclopedia

The results of a search for the term "lunar eclipse" in a web-based image search engine

A search engine is a software system that is designed to carry out web searches (Internet searches), which means to search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a line of results, often referred to as search engine results pages (SERPs) The information may be a mix of links to web pages, images, videos, infographics, articles, research papers, and other types of files. Some search engines also mine data available in databases or open directories. Unlike web directories, which are maintained only by human editors, search engines also maintain real-time information by running an algorithm on a web crawler. Internet content that is not capable of being searched by a web search engine is generally described as the deep web.

History

Timeline
Year Engine Current status
1993 W3Catalog Active
Aliweb Active
JumpStation Inactive
WWW Worm Inactive
1994 WebCrawler Active
Go.com Inactive, redirects to Disney
Lycos Active
Infoseek Inactive, redirects to Disney
1995 Yahoo! Search Active, initially a search function for Yahoo! Directory
Daum Active
Magellan Inactive
Excite Active
SAPO Active
MetaCrawler Active
AltaVista Inactive, acquired by Yahoo! in 2003, since 2013 redirects to Yahoo!
1996 RankDex Inactive, incorporated into Baidu in 2000
Dogpile Active, Aggregator
Inktomi Inactive, acquired by Yahoo!
HotBot Active
Ask Jeeves Active (rebranded ask.com)
1997 AOL NetFind Active (rebranded AOL Search since 1999)
Northern Light Inactive
Yandex Active
1998 Google Active
Ixquick Active as Startpage.com
MSN Search Active as Bing
empas Inactive (merged with NATE)
1999 AlltheWeb Inactive (URL redirected to Yahoo!)
GenieKnows Active, rebranded Yellowee (redirection to justlocalbusiness.com)
Naver Active
Teoma Active (© APN, LLC)
2000 Baidu Active
Exalead Inactive
Gigablast Active
2001 Kartoo Inactive
2003 Info.com Active
Scroogle Inactive
2004 A9.com Inactive
Clusty Active (as Yippy)
Mojeek Active
Sogou Active
2005 SearchMe Inactive
KidzSearch Active, Google Search
2006 Soso Inactive, merged with Sogou
Quaero Inactive
Search.com Active
ChaCha Inactive
Ask.com Active
Live Search Active as Bing, rebranded MSN Search
2007 wikiseek Inactive
Sproose Inactive
Wikia Search Inactive
Blackle.com Active, Google Search
2008 Powerset Inactive (redirects to Bing)
Picollator Inactive
Viewzi Inactive
Boogami Inactive
LeapFish Inactive
Forestle Inactive (redirects to Ecosia)
DuckDuckGo Active
2009 Bing Active, rebranded Live Search
Yebol Inactive
Mugurdy Inactive due to a lack of funding
Scout (Goby) Active
NATE Active
Ecosia Active
Startpage.com Active, sister engine of Ixquick
2010 Blekko Inactive, sold to IBM
Cuil Inactive
Yandex (English) Active
Parsijoo Active
2011 YaCy Active, P2P
2012 Volunia Inactive
2013 Qwant Active
2014 Egerin Active, Kurdish / Sorani
Swisscows Active
2015 Yooz Active
Cliqz Inactive
2016 Kiddle Active, Google Search

Pre-1990s

A system for locating published information intended to overcome the ever increasing difficulty of locating information in ever-growing centralized indices of scientific work was described in 1945 by Vannevar Bush, who wrote an article in The Atlantic Monthly titled "As We May Think" in which he envisioned libraries of research with connected annotations not unlike modern hyperlinks. Link analysis would eventually become a crucial component of search engines through algorithms such as Hyper Search and PageRank.

1990s: Birth of search engines

The first internet search engines predate the debut of the Web in December 1990: Who is user search dates back to 1982, and the Knowbot Information Service multi-network user search was first implemented in 1989. The first well documented search engine that searched content files, namely FTP files, was Archie, which debuted on 10 September 1990.

Prior to September 1993, the World Wide Web was entirely indexed by hand. There was a list of webservers edited by Tim Berners-Lee and hosted on the CERN webserver. One snapshot of the list in 1992 remains, but as more and more web servers went online the central list could no longer keep up. On the NCSA site, new servers were announced under the title "What's New!"

The first tool used for searching content (as opposed to users) on the Internet was Archie. The name stands for "archive" without the "v"., It was created by Alan Emtage computer science student at McGill University in Montreal, Quebec, Canada. The program downloaded the directory listings of all the files located on public anonymous FTP (File Transfer Protocol) sites, creating a searchable database of file names; however, Archie Search Engine did not index the contents of these sites since the amount of data was so limited it could be readily searched manually.

The rise of Gopher (created in 1991 by Mark McCahill at the University of Minnesota) led to two new search programs, Veronica and Jughead. Like Archie, they searched the file names and titles stored in Gopher index systems. Veronica (Very Easy Rodent-Oriented Net-wide Index to Computerized Archives) provided a keyword search of most Gopher menu titles in the entire Gopher listings. Jughead (Jonzy's Universal Gopher Hierarchy Excavation And Display) was a tool for obtaining menu information from specific Gopher servers. While the name of the search engine "Archie Search Engine" was not a reference to the Archie comic book series, "Veronica" and "Jughead" are characters in the series, thus referencing their predecessor.

In the summer of 1993, no search engine existed for the web, though numerous specialized catalogues were maintained by hand. Oscar Nierstrasz at the University of Geneva wrote a series of Perl scripts that periodically mirrored these pages and rewrote them into a standard format. This formed the basis for W3Catalog, the web's first primitive search engine, released on September 2, 1993.

In June 1993, Matthew Gray, then at MIT, produced what was probably the first web robot, the Perl-based World Wide Web Wanderer, and used it to generate an index called "Wandex". The purpose of the Wanderer was to measure the size of the World Wide Web, which it did until late 1995. The web's second search engine Aliweb appeared in November 1993. Aliweb did not use a web robot, but instead depended on being notified by website administrators of the existence at each site of an index file in a particular format.

JumpStation (created in December 1993 by Jonathon Fletcher) used a web robot to find web pages and to build its index, and used a web form as the interface to its query program. It was thus the first WWW resource-discovery tool to combine the three essential features of a web search engine (crawling, indexing, and searching) as described below. Because of the limited resources available on the platform it ran on, its indexing and hence searching were limited to the titles and headings found in the web pages the crawler encountered.

One of the first "all text" crawler-based search engines was WebCrawler, which came out in 1994. Unlike its predecessors, it allowed users to search for any word in any webpage, which has become the standard for all major search engines since. It was also the search engine that was widely known by the public. Also in 1994, Lycos (which started at Carnegie Mellon University) was launched and became a major commercial endeavor.

The first popular search engine on the Web was Yahoo! Search. The first product from Yahoo!, founded by Jerry Yang and David Filo in January 1994, was a Web directory called Yahoo! Directory. In 1995, a search function was added, allowing users to search Yahoo! Directory! It became one of the most popular ways for people to find web pages of interest, but its search function operated on its web directory, rather than its full-text copies of web pages.

Soon after, a number of search engines appeared and vied for popularity. These included Magellan, Excite, Infoseek, Inktomi, Northern Light, and AltaVista. Information seekers could also browse the directory instead of doing a keyword-based search.

In 1996, Robin Li developed the RankDex site-scoring algorithm for search engines results page ranking and received a US patent for the technology. It was the first search engine that used hyperlinks to measure the quality of websites it was indexing, predating the very similar algorithm patent filed by Google two years later in 1998. Larry Page referenced Li's work in some of his U.S. patents for PageRank. Li later used his Rankdex technology for the Baidu search engine, which was founded by Robin Li in China and launched in 2000.

In 1996, Netscape was looking to give a single search engine an exclusive deal as the featured search engine on Netscape's web browser. There was so much interest that instead Netscape struck deals with five of the major search engines: for $5 million a year, each search engine would be in rotation on the Netscape search engine page. The five engines were Yahoo!, Magellan, Lycos, Infoseek, and Excite.

Google adopted the idea of selling search terms in 1998, from a small search engine company named goto.com. This move had a significant effect on the SE business, which went from struggling to one of the most profitable businesses in the Internet.

Search engines were also known as some of the brightest stars in the Internet investing frenzy that occurred in the late 1990s. Several companies entered the market spectacularly, receiving record gains during their initial public offerings. Some have taken down their public search engine, and are marketing enterprise-only editions, such as Northern Light. Many search engine companies were caught up in the dot-com bubble, a speculation-driven market boom that peaked in 1990 and ended in 2000.

2000's-Present: Post dot-com bubble

Around 2000, Google's search engine rose to prominence. The company achieved better results for many searches with an algorithm called PageRank, as was explained in the paper Anatomy of a Search Engine written by Sergey Brin and Larry Page, the later founders of Google. This iterative algorithm ranks web pages based on the number and PageRank of other web sites and pages that link there, on the premise that good or desirable pages are linked to more than others. Larry Page's patent for PageRank cites Robin Li's earlier RankDex patent as an influence. Google also maintained a minimalist interface to its search engine. In contrast, many of its competitors embedded a search engine in a web portal. In fact, the Google search engine became so popular that spoof engines emerged such as Mystery Seeker.

By 2000, Yahoo! was providing search services based on Inktomi's search engine. Yahoo! acquired Inktomi in 2002, and Overture (which owned AlltheWeb and AltaVista) in 2003. Yahoo! switched to Google's search engine until 2004, when it launched its own search engine based on the combined technologies of its acquisitions.

Microsoft first launched MSN Search in the fall of 1998 using search results from Inktomi. In early 1999 the site began to display listings from Looksmart, blended with results from Inktomi. For a short time in 1999, MSN Search used results from AltaVista instead. In 2004, Microsoft began a transition to its own search technology, powered by its own web crawler (called msnbot).

Microsoft's rebranded search engine, Bing, was launched on June 1, 2009. On July 29, 2009, Yahoo! and Microsoft finalized a deal in which Yahoo! Search would be powered by Microsoft Bing technology.

As of 2019, active search engine crawlers include those of Google, Sogou, Baidu, Bing, Gigablast, Mojeek, DuckDuckGo and Yandex.

Approach

A search engine maintains the following processes in near real time:

  1. Web crawling
  2. Indexing
  3. Searching

Web search engines get their information by web crawling from site to site. The "spider" checks for the standard filename robots.txt, addressed to it. The robots.txt file contains directives for search spiders, telling it which pages to crawl. After checking for robots.txt and either finding it or not, the spider sends certain information back to be indexed depending on many factors, such as the titles, page content, JavaScript, Cascading Style Sheets (CSS), headings, or its metadata in HTML meta tags. After a certain number of pages crawled, amount of data indexed, or time spent on the website, the spider stops crawling and moves on. "[N]o web crawler may actually crawl the entire reachable web. Due to infinite websites, spider traps, spam, and other exigencies of the real web, crawlers instead apply a crawl policy to determine when the crawling of a site should be deemed sufficient. Some websites are crawled exhaustively, while others are crawled only partially".

Indexing means associating words and other definable tokens found on web pages to their domain names and HTML-based fields. The associations are made in a public database, made available for web search queries. A query from a user can be a single word, multiple words or a sentence. The index helps find information relating to the query as quickly as possible. Some of the techniques for indexing, and caching are trade secrets, whereas web crawling is a straightforward process of visiting all sites on a systematic basis.

Between visits by the spider, the cached version of page (some or all the content needed to render it) stored in the search engine working memory is quickly sent to an inquirer. If a visit is overdue, the search engine can just act as a web proxy instead. In this case the page may differ from the search terms indexed. The cached page holds the appearance of the version whose words were previously indexed, so a cached version of a page can be useful to the web site when the actual page has been lost, but this problem is also considered a mild form of linkrot.

High-level architecture of a standard Web crawler

Typically when a user enters a query into a search engine it is a few keywords. The index already has the names of the sites containing the keywords, and these are instantly obtained from the index. The real processing load is in generating the web pages that are the search results list: Every page in the entire list must be weighted according to information in the indexes. Then the top search result item requires the lookup, reconstruction, and markup of the snippets showing the context of the keywords matched. These are only part of the processing each search results web page requires, and further pages (next to the top) require more of this post processing.

Beyond simple keyword lookups, search engines offer their own GUI- or command-driven operators and search parameters to refine the search results. These provide the necessary controls for the user engaged in the feedback loop users create by filtering and weighting while refining the search results, given the initial pages of the first search results. For example, from 2007 the Google.com search engine has allowed one to filter by date by clicking "Show search tools" in the leftmost column of the initial search results page, and then selecting the desired date range. It's also possible to weight by date because each page has a modification time. Most search engines support the use of the boolean operators AND, OR and NOT to help end users refine the search query. Boolean operators are for literal searches that allow the user to refine and extend the terms of the search. The engine looks for the words or phrases exactly as entered. Some search engines provide an advanced feature called proximity search, which allows users to define the distance between keywords. There is also concept-based searching where the research involves using statistical analysis on pages containing the words or phrases you search for. As well, natural language queries allow the user to type a question in the same form one would ask it to a human. A site like this would be ask.com.

The usefulness of a search engine depends on the relevance of the result set it gives back. While there may be millions of web pages that include a particular word or phrase, some pages may be more relevant, popular, or authoritative than others. Most search engines employ methods to rank the results to provide the "best" results first. How a search engine decides which pages are the best matches, and what order the results should be shown in, varies widely from one engine to another. The methods also change over time as Internet usage changes and new techniques evolve. There are two main types of search engine that have evolved: one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively. The other is a system that generates an "inverted index" by analyzing texts it locates. This first form relies much more heavily on the computer itself to do the bulk of the work.

Most Web search engines are commercial ventures supported by advertising revenue and thus some of them allow advertisers to have their listings ranked higher in search results for a fee. Search engines that do not accept money for their search results make money by running search related ads alongside the regular search engine results. The search engines make money every time someone clicks on one of these ads.

Local search

Local search is the process that optimizes efforts of local businesses. They focus on change to make sure all searches are consistent. It's important because many people determine where they plan to go and what to buy based on their searches.

Market share

As of February 2021, Google is the world's most used search engine, with a market share of 92.04%, and the world's other most used search engines were:

Russia and East Asia

In Russia, Yandex has a market share of 61.9%, compared to Google's 28.3%. In China, Baidu is the most popular search engine. South Korea's homegrown search portal, Naver, is used for 70% of online searches in the country. Yahoo! Japan and Yahoo! Taiwan are the most popular avenues for Internet searches in Japan and Taiwan, respectively. China is one of few countries where Google is not in the top three web search engines for market share. Google was previously a top search engine in China, but had to withdraw after failing to follow China's laws.

Europe

Most countries' markets in Western Europe are dominated by Google, except for the Czech Republic, where Seznam is a strong competitor.

Search engine bias

Although search engines are programmed to rank websites based on some combination of their popularity and relevancy, empirical studies indicate various political, economic, and social biases in the information they provide and the underlying assumptions about the technology. These biases can be a direct result of economic and commercial processes (e.g., companies that advertise with a search engine can become also more popular in its organic search results), and political processes (e.g., the removal of search results to comply with local laws). For example, Google will not surface certain neo-Nazi websites in France and Germany, where Holocaust denial is illegal.

Biases can also be a result of social processes, as search engine algorithms are frequently designed to exclude non-normative viewpoints in favor of more "popular" results. Indexing algorithms of major search engines skew towards coverage of U.S.-based sites, rather than websites from non-U.S. countries.

Google Bombing is one example of an attempt to manipulate search results for political, social or commercial reasons.

Several scholars have studied the cultural changes triggered by search engines, and the representation of certain controversial topics in their results, such as terrorism in Ireland, climate change denial, and conspiracy theories.

Customized results and filter bubbles

Many search engines such as Google and Bing provide customized results based on the user's activity history. This leads to an effect that has been called a filter bubble. The term describes a phenomenon in which websites use algorithms to selectively guess what information a user would like to see, based on information about the user (such as location, past click behaviour and search history). As a result, websites tend to show only information that agrees with the user's past viewpoint. This puts the user in a state of intellectual isolation without contrary information. Prime examples are Google's personalized search results and Facebook's personalized news stream. According to Eli Pariser, who coined the term, users get less exposure to conflicting viewpoints and are isolated intellectually in their own informational bubble. Pariser related an example in which one user searched Google for "BP" and got investment news about British Petroleum while another searcher got information about the Deepwater Horizon oil spill and that the two search results pages were "strikingly different". The bubble effect may have negative implications for civic discourse, according to Pariser. Since this problem has been identified, competing search engines have emerged that seek to avoid this problem by not tracking or "bubbling" users, such as DuckDuckGo. Other scholars do not share Pariser's view, finding the evidence in support of his thesis unconvincing.

Religious search engines

The global growth of the Internet and electronic media in the Arab and Muslim World during the last decade has encouraged Islamic adherents in the Middle East and Asian sub-continent, to attempt their own search engines, their own filtered search portals that would enable users to perform safe searches. More than usual safe search filters, these Islamic web portals categorizing websites into being either "halal" or "haram", based on interpretation of the "Law of Islam". ImHalal came online in September 2011. Halalgoogling came online in July 2013. These use haram filters on the collections from Google and Bing (and others).

While lack of investment and slow pace in technologies in the Muslim World has hindered progress and thwarted success of an Islamic search engine, targeting as the main consumers Islamic adherents, projects like Muxlim, a Muslim lifestyle site, did receive millions of dollars from investors like Rite Internet Ventures, and it also faltered. Other religion-oriented search engines are Jewogle, the Jewish version of Google, and SeekFind.org, which is Christian. SeekFind filters sites that attack or degrade their faith.

Search engine submission

Web search engine submission is a process in which a webmaster submits a website directly to a search engine. While search engine submission is sometimes presented as a way to promote a website, it generally is not necessary because the major search engines use web crawlers that will eventually find most web sites on the Internet without assistance. They can either submit one web page at a time, or they can submit the entire site using a sitemap, but it is normally only necessary to submit the home page of a web site as search engines are able to crawl a well designed website. There are two remaining reasons to submit a web site or web page to a search engine: to add an entirely new web site without waiting for a search engine to discover it, and to have a web site's record updated after a substantial redesign.

Some search engine submission software not only submits websites to multiple search engines, but also adds links to websites from their own pages. This could appear helpful in increasing a website's ranking, because external links are one of the most important factors determining a website's ranking. However, John Mueller of Google has stated that this "can lead to a tremendous number of unnatural links for your site" with a negative impact on site ranking.

API

From Wikipedia, the free encyclopedia
https://en.wikipedia.org/wiki/API

In computing, an application programming interface (API) is an interface that defines interactions between multiple software applications or mixed hardware-software intermediaries. It defines the kinds of calls or requests that can be made, how to make them, the data formats that should be used, the conventions to follow, etc. It can also provide extension mechanisms so that users can extend existing functionality in various ways and to varying degrees. An API can be entirely custom, specific to a component, or designed based on an industry-standard to ensure interoperability. Through information hiding, APIs enable modular programming, allowing users to use the interface independently of the implementation.

Reference to Web APIs is currently the most common use of the term. There are also APIs for programming languages, software libraries, computer operating systems, and computer hardware. APIs originated in the 1940s, though the term API did not emerge until the 1960s and 70s.

Purpose

In building applications, an API (application programming interface) simplifies programming by abstracting the underlying implementation and only exposing objects or actions the developer needs. While a graphical interface for an email client might provide a user with a button that performs all the steps for fetching and highlighting new emails, an API for file input/output might give the developer a function that copies a file from one location to another without requiring that the developer understand the file system operations occurring behind the scenes.

History of the term

A diagram from 1978 proposing the expansion of the idea of the API to become a general programming interface, beyond application programs alone.

The meaning of the term API has expanded over its history. It first described an interface only for end-user-facing programs, known as application programs. This origin is still reflected in the name "application programming interface." Today, the term API is broader, including also utility software and even hardware interfaces.

The idea of the API is much older than the term. British computer scientists Wilkes and Wheeler worked on modular software libraries in the 1940s for the EDSAC computer. Their book The Preparation of Programs for an Electronic Digital Computer contains the first published API specification. Joshua Bloch claims that Wilkes and Wheeler "latently invented" the API, because it is more of a concept that is discovered than invented.

Although the people who coined the term API were implementing software on a Univac 1108, the goal of their API was to make hardware independent programs possible.

The term "application program interface" (without an -ing suffix) is first recorded in a paper called Data structures and techniques for remote computer graphics presented at an AFIPS conference in 1968. The authors of this paper use the term to describe the interaction of an application — a graphics program in this case — with the rest of the computer system. A consistent application interface (consisting of Fortran subroutine calls) was intended to free the programmer from dealing with idiosyncrasies of the graphics display device, and to provide hardware independence if the computer or the display were replaced.

The term was introduced to the field of databases by C. J. Date in a 1974 paper called The Relational and Network Approaches: Comparison of the Application Programming Interface. An API became a part of ANSI/SPARC framework for database management systems. This framework treated the application programming interface separately from other interfaces, such as the query interface. Database professionals in the 1970s observed these different interfaces could be combined; a sufficiently rich application interface could support the other interfaces as well.

This observation led to APIs that supported all types of programming, not just application programming. By 1990, the API was defined simply as "a set of services available to a programmer for performing certain tasks" by technologist Carl Malamud.

The conception of the API was expanded again with the dawn of web APIs. Roy Fielding's dissertation Architectural Styles and the Design of Network-based Software Architectures at UC Irvine in 2000 outlined Representational state transfer (REST) and described the idea of a "network-based Application Programming Interface" that Fielding contrasted with traditional "library-based" APIs. XML and JSON web APIs saw widespread commercial adoption beginning in 2000 and continuing as of 2021.

The web API is now the most common meaning of the term API. When used in this way, the term API has some overlap in meaning with the terms communication protocol and remote procedure call.

The Semantic Web proposed by Tim Berners-Lee in 2001 included "semantic APIs" that recast the API as an open, distributed data interface rather than a software behavior interface. Instead, proprietary interfaces and agents became more widespread.

Usage

Libraries and frameworks

The interface to a software library is one type of API. The API describes and prescribes the "expected behavior" (a specification) while the library is an "actual implementation" of this set of rules.

A single API can have multiple implementations (or none, being abstract) in the form of different libraries that share the same programming interface.

The separation of the API from its implementation can allow programs written in one language to use a library written in another. For example, because Scala and Java compile to compatible bytecode, Scala developers can take advantage of any Java API.

API use can vary depending on the type of programming language involved. An API for a procedural language such as Lua could consist primarily of basic routines to execute code, manipulate data or handle errors while an API for an object-oriented language, such as Java, would provide a specification of classes and its class methods.

Language bindings are also APIs. By mapping the features and capabilities of one language to an interface implemented in another language, a language binding allows a library or service written in one language to be used when developing in another language. Tools such as SWIG and F2PY, a Fortran-to-Python interface generator, facilitate the creation of such interfaces.

An API can also be related to a software framework: a framework can be based on several libraries implementing several APIs, but unlike the normal use of an API, the access to the behavior built into the framework is mediated by extending its content with new classes plugged into the framework itself.

Moreover, the overall program flow of control can be out of the control of the caller and in the framework's hands by inversion of control or a similar mechanism.

Operating systems

An API can specify the interface between an application and the operating system. POSIX, for example, specifies a set of common APIs that aim to enable an application written for a POSIX conformant operating system to be compiled for another POSIX conformant operating system.

Linux and Berkeley Software Distribution are examples of operating systems that implement the POSIX APIs.

Microsoft has shown a strong commitment to a backward-compatible API, particularly within its Windows API (Win32) library, so older applications may run on newer versions of Windows using an executable-specific setting called "Compatibility Mode".

An API differs from an application binary interface (ABI) in that an API is source code based while an ABI is binary based. For instance, POSIX provides APIs while the Linux Standard Base provides an ABI.

Remote APIs

Remote APIs allow developers to manipulate remote resources through protocols, specific standards for communication that allow different technologies to work together, regardless of language or platform. For example, the Java Database Connectivity API allows developers to query many different types of databases with the same set of functions, while the Java remote method invocation API uses the Java Remote Method Protocol to allow invocation of functions that operate remotely, but appear local to the developer.

Therefore, remote APIs are useful in maintaining the object abstraction in object-oriented programming; a method call, executed locally on a proxy object, invokes the corresponding method on the remote object, using the remoting protocol, and acquires the result to be used locally as a return value.

A modification of the proxy object will also result in a corresponding modification of the remote object.

Web APIs

Web APIs are the defined interfaces through which interactions happen between an enterprise and applications that use its assets, which also is a Service Level Agreement (SLA) to specify the functional provider and expose the service path or URL for its API users. An API approach is an architectural approach that revolves around providing a program interface to a set of services to different applications serving different types of consumers.

When used in the context of web development, an API is typically defined as a set of specifications, such as Hypertext Transfer Protocol (HTTP) request messages, along with a definition of the structure of response messages, usually in an Extensible Markup Language (XML) or JavaScript Object Notation (JSON) format. An example might be a shipping company API that can be added to an eCommerce-focused website to facilitate ordering shipping services and automatically include current shipping rates, without the site developer having to enter the shipper's rate table into a web database. While "web API" historically has been virtually synonymous with web service, the recent trend (so-called Web 2.0) has been moving away from Simple Object Access Protocol (SOAP) based web services and service-oriented architecture (SOA) towards more direct representational state transfer (REST) style web resources and resource-oriented architecture (ROA). Part of this trend is related to the Semantic Web movement toward Resource Description Framework (RDF), a concept to promote web-based ontology engineering technologies. Web APIs allow the combination of multiple APIs into new applications known as mashups. In the social media space, web APIs have allowed web communities to facilitate sharing content and data between communities and applications. In this way, content that is created in one place dynamically can be posted and updated to multiple locations on the web. For example, Twitter's REST API allows developers to access core Twitter data and the Search API provides methods for developers to interact with Twitter Search and trends data.

Design

The design of an API has significant impact on its usage. The principle of information hiding describes the role of programming interfaces as enabling modular programming by hiding the implementation details of the modules so that users of modules need not understand the complexities inside the modules. Thus, the design of an API attempts to provide only the tools a user would expect. The design of programming interfaces represents an important part of software architecture, the organization of a complex piece of software.

Release policies

APIs are one of the more common ways technology companies integrate. Those that provide and use APIs are considered as being members of a business ecosystem.

The main policies for releasing an API are:

  • Private: The API is for internal company use only.
  • Partner: Only specific business partners can use the API. For example, vehicle for hire companies such as Uber and Lyft allow approved third-party developers to directly order rides from within their apps. This allows the companies to exercise quality control by curating which apps have access to the API, and provides them with an additional revenue stream. Public: The API is available for use by the public. For example, Microsoft makes the Windows API public, and Apple releases its API Cocoa, so that software can be written for their platforms. Not all public APIs are generally accessible by everybody. For example, Internet service providers like Cloudflare or Voxility, use RESTful APIs to allow customers and resellers access to their infrastructure information, DDoS stats, network performance or dashboard controls. Access to such APIs is granted either by “API tokens”, or customer status validations.

Public API implications

An important factor when an API becomes public is its "interface stability". Changes to the API—for example adding new parameters to a function call—could break compatibility with the clients that depend on that API.

When parts of a publicly presented API are subject to change and thus not stable, such parts of a particular API should be documented explicitly as "unstable". For example, in the Google Guava library, the parts that are considered unstable, and that might change soon, are marked with the Java annotation @Beta.

A public API can sometimes declare parts of itself as deprecated or rescinded. This usually means that part of the API should be considered a candidate for being removed, or modified in a backward incompatible way. Therefore, these changes allow developers to transition away from parts of the API that will be removed or not supported in the future.

Client code may contain innovative or opportunistic usages that were not intended by the API designers. In other words, for a library with a significant user base, when an element becomes part of the public API, it may be used in diverse ways. On February 19, 2020, Akamai published their annual “State of the Internet” report, showcasing the growing trend of cybercriminals targeting public API platforms at financial services worldwide. From December 2017 through November 2019, Akamai witnessed 85.42 billion credential violation attacks. About 20%, or 16.55 billion, were against hostnames defined as API endpoints. Of these, 473.5 million have targeted financial services sector organizations.

Documentation

API documentation describes what services an API offers and how to use those services, aiming to cover everything a client would need to know for practical purposes.

Documentation is crucial for the development and maintenance of applications using the API. API documentation is traditionally found in documentation files but can also be found in social media such as blogs, forums, and Q&A websites.

Traditional documentation files are often presented via a documentation system, such as Javadoc or Pydoc, that has a consistent appearance and structure. However, the types of content included in the documentation differs from API to API.

In the interest of clarity, API documentation may include a description of classes and methods in the API as well as "typical usage scenarios, code snippets, design rationales, performance discussions, and contracts", but implementation details of the API services themselves are usually omitted.

Restrictions and limitations on how the API can be used are also covered by the documentation. For instance, documentation for an API function could note that its parameters cannot be null, that the function itself is not thread safe, Because API documentation tends to be comprehensive, it is a challenge for writers to keep the documentation updated and for users to read it carefully, potentially yielding bugs.

API documentation can be enriched with metadata information like Java annotations. This metadata can be used by the compiler, tools, and by the run-time environment to implement custom behaviors or custom handling.

It is possible to generate API documentation in a data-driven manner. By observing many programs that use a given API, it is possible to infer the typical usages, as well the required contracts and directives. Then, templates can be used to generate natural language from the mined data.

Dispute over copyright protection for APIs

In 2010, Oracle Corporation sued Google for having distributed a new implementation of Java embedded in the Android operating system. Google had not acquired any permission to reproduce the Java API, although permission had been given to the similar OpenJDK project. Judge William Alsup ruled in the Oracle v. Google case that APIs cannot be copyrighted in the U.S and that a victory for Oracle would have widely expanded copyright protection to a "functional set of symbols" and allowed the copyrighting of simple software commands:

To accept Oracle's claim would be to allow anyone to copyright one version of code to carry out a system of commands and thereby bar all others from writing its different versions to carry out all or part of the same commands.

In 2014, however, Alsup's ruling was overturned on appeal to the Court of Appeals for the Federal Circuit, though the question of whether such use of APIs constitutes fair use was left unresolved. 

In 2016, following a two-week trial, a jury determined that Google's reimplementation of the Java API constituted fair use, but Oracle vowed to appeal the decision. Oracle won on its appeal, with the Court of Appeals for the Federal Circuit ruling that Google's use of the APIs did not qualify for fair use. In 2019, Google appealed to the Supreme Court of the United States over both the copyrightability and fair use rulings, and the Supreme Court granted review. Due to the COVID-19 pandemic, the oral hearings in the case were delayed until October 2020.

PubMed

From Wikipedia, the free encyclopedia

PubMed logo blue.svg
Contact
Research centerUnited States National Library of Medicine (NLM)
Release dateJanuary 1996
Access
Websitepubmed.ncbi.nlm.nih.gov

PubMed is a free search engine accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine (NLM) at the National Institutes of Health maintain the database as part of the Entrez system of information retrieval.

From 1971 to 1997, online access to the MEDLINE database had been primarily through institutional facilities, such as university libraries. PubMed, first released in January 1996, ushered in the era of private, free, home- and office-based MEDLINE searching. The PubMed system was offered free to the public starting in June 1997.

Content

In addition to MEDLINE, PubMed provides access to:

  • older references from the print version of Index Medicus, back to 1951 and earlier
  • references to some journals before they were indexed in Index Medicus and MEDLINE, for instance Science, BMJ, and Annals of Surgery
  • very recent entries to records for an article before it is indexed with Medical Subject Headings (MeSH) and added to MEDLINE
  • a collection of books available full-text and other subsets of NLM records
  • PMC citations
  • NCBI Bookshelf

Many PubMed records contain links to full text articles, some of which are freely available, often in PubMed Central and local mirrors, such as Europe PubMed Central.

Information about the journals indexed in MEDLINE, and available through PubMed, is found in the NLM Catalog.

As of 27 January 2020, PubMed has more than 30 million citations and abstracts dating back to 1966, selectively to the year 1865, and very selectively to 1809. As of the same date, 20 million of PubMed's records are listed with their abstracts, and 21.5 million records have links to full-text versions (of which 7.5 million articles are available, full-text for free). Over the last 10 years (ending 31 December 2019), an average of nearly 1 million new records were added each year. Approximately 12% of the records in PubMed correspond to cancer-related entries, which have grown from 6% in the 1950s to 16% in 2016. Other significant proportion of records correspond to "chemistry" (8.69%), "therapy" (8.39%), and "infection" (5%).

In 2016, NLM changed the indexing system so that publishers are able to directly correct typos and errors in PubMed indexed articles.

PubMed has been reported to include some articles published in predatory journals. MEDLINE and PubMed policies for the selection of journals for database inclusion are slightly different. Weaknesses in the criteria and procedures for indexing journals in PubMed Central may allow publications from predatory journals to leak into PubMed.

Characteristics

Website design

A new PubMed interface was launched in October 2009 and encouraged the use of such quick, Google-like search formulations; they have also been described as 'telegram' searches. By default the results are sorted by Most Recent, but this can be changed to Best Match, Publication Date, First Author, Last Author, Journal, or Title.

The PubMed website design and domain was updated in January 2020 and became default on 15 May 2020, with the updated and new features. There was a critical reaction from many researchers who frequently use the site.

PubMed for handhelds/mobiles

PubMed/MEDLINE can be accessed via handheld devices, using for instance the "PICO" option (for focused clinical questions) created by the NLM. A "PubMed Mobile" option, providing access to a mobile friendly, simplified PubMed version, is also available.

Search

Standard search

Simple searches on PubMed can be carried out by entering key aspects of a subject into PubMed's search window.

PubMed translates this initial search formulation and automatically adds field names, relevant MeSH (Medical Subject Headings) terms, synonyms, Boolean operators, and 'nests' the resulting terms appropriately, enhancing the search formulation significantly, in particular by routinely combining (using the OR operator) textwords and MeSH terms.

The examples given in a PubMed tutorial demonstrate how this automatic process works:

Causes Sleep Walking is translated as ("etiology"[Subheading] OR "etiology"[All Fields] OR "causes"[All Fields] OR "causality"[MeSH Terms] OR "causality"[All Fields]) AND ("somnambulism"[MeSH Terms] OR "somnambulism"[All Fields] OR ("sleep"[All Fields] AND "walking"[All Fields]) OR "sleep walking"[All Fields])

Likewise,

soft Attack Aspirin Prevention is translated as ("myocardial infarction"[MeSH Terms] OR ("myocardial"[All Fields] AND "infarction"[All Fields]) OR "myocardial infarction"[All Fields] OR ("heart"[All Fields] AND "attack"[All Fields]) OR "heart attack"[All Fields]) AND ("aspirin"[MeSH Terms] OR "aspirin"[All Fields]) AND ("prevention and control"[Subheading] OR ("prevention"[All Fields] AND "control"[All Fields]) OR "prevention and control"[All Fields] OR "prevention"[All Fields])

Comprehensive search

For optimal searches in PubMed, it is necessary to understand its core component, MEDLINE, and especially of the MeSH (Medical Subject Headings) controlled vocabulary used to index MEDLINE articles. They may also require complex search strategies, use of field names (tags), proper use of limits and other features; reference librarians and search specialists offer search services.

The search into PubMed's search window is only recommended for the search of unequivocal topics or new interventions that do not yet have a MeSH heading created, as well as for the search for commercial brands of medicines and proper nouns. It is also useful when there is no suitable heading or the descriptor represents a partial aspect. The search using the thesaurus MeSH is more accurate and will give fewer irrelevant results. In addition, it saves the disadvantage of the free text search in which the spelling, singular/plural or abbreviated differences have to be taken into consideration. On the other side, articles more recently incorporated into the database to which descriptors have not yet been assigned will not be found. Therefore, to guarantee an exhaustive search, a combination of controlled language headings and free text terms must be used.

Journal article parameters

When a journal article is indexed, numerous article parameters are extracted and stored as structured information. Such parameters are: Article Type (MeSH terms, e.g., "Clinical Trial"), Secondary identifiers, (MeSH terms), Language, Country of the Journal or publication history (e-publication date, print journal publication date).

Publication Type: Clinical queries/systematic reviews

Publication type parameter allows searching by the type of publication, including reports of various kinds of clinical research.

Secondary ID

Since July 2005, the MEDLINE article indexing process extracts identifiers from the article abstract and puts those in a field called Secondary Identifier (SI). The secondary identifier field is to store accession numbers to various databases of molecular sequence data, gene expression or chemical compounds and clinical trial IDs. For clinical trials, PubMed extracts trial IDs for the two largest trial registries: ClinicalTrials.gov (NCT identifier) and the International Standard Randomized Controlled Trial Number Register (IRCTN identifier).

See also

A reference which is judged particularly relevant can be marked and "related articles" can be identified. If relevant, several studies can be selected and related articles to all of them can be generated (on PubMed or any of the other NCBI Entrez databases) using the 'Find related data' option. The related articles are then listed in order of "relatedness". To create these lists of related articles, PubMed compares words from the title and abstract of each citation, as well as the MeSH headings assigned, using a powerful word-weighted algorithm. The 'related articles' function has been judged to be so precise that the authors of a paper suggested it can be used instead of a full search.

Mapping to MeSH

PubMed automatically links to MeSH terms and subheadings. Examples would be: "bad breath" links to (and includes in the search) "halitosis", "heart attack" to "myocardial infarction", "breast cancer" to "breast neoplasms". Where appropriate, these MeSH terms are automatically "expanded", that is, include more specific terms. Terms like "nursing" are automatically linked to "Nursing [MeSH]" or "Nursing [Subheading]". This feature is called Auto Term Mapping and is enacted, by default, in free text searching but not exact phrase searching (i.e. enclosing the search query with double quotes). This feature makes PubMed searches more sensitive and avoids false-negative (missed) hits by compensating for the diversity of medical terminology.

PubMed does not apply automatic mapping of the term in the following circumstances: by writing the quoted phrase (e.g., "kidney allograft"), when truncated on the asterisk (e.g., kidney allograft*), and when looking with field labels (e.g., Cancer [ti]).

My NCBI

The PubMed optional facility "My NCBI" (with free registration) provides tools for

  • saving searches
  • filtering search results
  • setting up automatic updates sent by e-mail
  • saving sets of references retrieved as part of a PubMed search
  • configuring display formats or highlighting search terms

and a wide range of other options. The "My NCBI" area can be accessed from any computer with web-access. An earlier version of "My NCBI" was called "PubMed Cubby".

LinkOut

LinkOut is an NLM facility to link and make available full-text local journal holdings. Some 3,200 sites (mainly academic institutions) participate in this NLM facility (as of March 2010), from Aalborg University in Denmark to ZymoGenetics in Seattle. Users at these institutions see their institution's logo within the PubMed search result (if the journal is held at that institution) and can access the full-text. Link out is being consolidated with Outside Tool as of the major platform update coming in the Summer of 2019.

PubMed Commons

In 2016, PubMed allows authors of articles to comment on articles indexed by PubMed. This feature was initially tested in a pilot mode (since 2013) and was made permanent in 2016. In February 2018, PubMed Commons was discontinued due to the fact that "usage has remained minimal".

askMEDLINE

askMEDLINE, a free-text, natural language query tool for MEDLINE/PubMed, developed by the NLM, also suitable for handhelds.

PubMed identifier

A PMID (PubMed identifier or PubMed unique identifier) is a unique integer value, starting at 1, assigned to each PubMed record. A PMID is not the same as a PMCID (PubMed Central identifier) which is the identifier for all works published in the free-to-access PubMed Central.

The assignment of a PMID or PMCID to a publication tells the reader nothing about the type or quality of the content. PMIDs are assigned to letters to the editor, editorial opinions, op-ed columns, and any other piece that the editor chooses to include in the journal, as well as peer-reviewed papers. The existence of the identification number is also not proof that the papers have not been retracted for fraud, incompetence, or misconduct. The announcement about any corrections to original papers may be assigned a PMID.

Each number that is entered in the PubMed search window is treated by default as if it were a PMID. Therefore, any reference in PubMed can be located using the PMID.

Alternative interfaces

MEDLINE is one of the databases which are accessible via PubMed. Several companies provide access to MEDLINE through their platforms.

The National Library of Medicine leases the MEDLINE information to a number of private vendors such as Embase, Ovid, Dialog, EBSCO, Knowledge Finder and many other commercial, non-commercial, and academic providers. As of October 2008, more than 500 licenses had been issued, more than 200 of them to providers outside the United States. As licenses to use MEDLINE data are available for free, the NLM in effect provides a free testing ground for a wide range of alternative interfaces and 3rd party additions to PubMed, one of a very few large, professionally curated databases which offers this option.

Lu identifies a sample of 28 current and free Web-based PubMed versions, requiring no installation or registration, which are grouped into four categories:

  1. Ranking search results, for instance: eTBLAST; MedlineRanker; MiSearch;
  2. Clustering results by topics, authors, journals etc., for instance: Anne O'Tate; ClusterMed;
  3. Enhancing semantics and visualization, for instance: EBIMed; MedEvi.
  4. Improved search interface and retrieval experience, for instance, askMEDLINEBabelMeSH; and PubCrawler.

As most of these and other alternatives rely essentially on PubMed/MEDLINE data leased under license from the NLM/PubMed, the term "PubMed derivatives" has been suggested. Without the need to store about 90 GB of original PubMed Datasets, anybody can write PubMed applications using the eutils-application program interface as described in "The E-utilities In-Depth: Parameters, Syntax and More", by Eric Sayers, PhD. Various citation format generators, taking PMID numbers as input, are examples of web applications making use of the eutils-application program interface. Sample web pages include Citation Generator - Mick Schroeder, Pubmed Citation Generator - Ultrasound of the Week, PMID2cite, and Cite this for me.

Data mining of PubMed

Alternative methods to mine the data in PubMed use programming environments such as Matlab, Python or R. In these cases, queries of PubMed are written as lines of code and passed to PubMed and the response is then processed directly in the programming environment. Code can be automated to systematically queries with different keywords such as disease, year, organs, etc. A recent publication (2017) found that the proportion of cancer-related entries in PubMed has risen from 6% in the 1950s to 16% in 2016.

The data accessible by PubMed can be mirrored locally using an unofficial tool such as MEDOC.

Millions of PubMed records augment various open data datasets about open access, like Unpaywall. Data analysis tools like Unpaywall Journals are used by libraries to assist with big deal cancellations: libraries can avoid subscriptions for materials already served by instant open access via open archives like PubMed Central.

Butane

From Wikipedia, the free encyclopedia ...