Search This Blog

Saturday, March 27, 2021

Biomedical text mining

From Wikipedia, the free encyclopedia

Biomedical text mining (including biomedical natural language processing or BioNLP) refers to the methods and study of how text mining may be applied to texts and literature of the biomedical and molecular biology domains. As a field of research, biomedical text mining incorporates ideas from natural language processing, bioinformatics, medical informatics and computational linguistics. The strategies developed through studies in this field are frequently applied to the biomedical and molecular biology literature available through services such as PubMed.

Considerations

Applying text mining approaches to biomedical text requires specific considerations common to the domain.

Availability of annotated text data

This figure presents several properties of a biomedical literature corpus prepared by Westergaard et al. The corpus includes 15 million English-language full text articles.(a) Number of publications per year from 1823–2016. (b) Temporal development in the distribution of six different topical categories from 1823–2016. (c) Development in the number of pages per article from 1823–2016.

Large annotated corpora used in the development and training of general purpose text mining methods (e.g., sets of movie dialogue, product reviews, or Wikipedia article text) are not specific for biomedical language. While they may provide evidence of general text properties such as parts of speech, they rarely contain concepts of interest to biologists or clinicians. Development of new methods to identify features specific to biomedical documents therefore requires assembly of specialized corpora. Resources designed to aid in building new biomedical text mining methods have been developed through the Informatics for Integrating Biology and the Bedside (i2b2) challenges and biomedical informatics researchers. Text mining researchers frequently combine these corpora with the controlled vocabularies and ontologies available through the National Library of Medicine's Unified Medical Language System (UMLS) and Medical Subject Headings (MeSH).

Machine learning-based methods often require very large data sets as training data to build useful models. Manual annotation of large text corpora is not realistically possible. Training data may therefore be products of weak supervision or purely statistical methods.

Data structure variation

Like other text documents, biomedical documents contain unstructured data. Research publications follow different formats, contain different types of information, and are interspersed with figures, tables, and other non-text content. Both unstructured text and semi-structured document elements, such as tables, may contain important information that should be text mined. Clinical documents may vary in structure and language between departments and locations. Other types of biomedical text, such as drug labels, may follow general structural guidelines but lack further details.

Uncertainty

Biomedical literature contains statements about observations that may not be statements of fact. This text may express uncertainty or skepticism about claims. Without specific adaptations, text mining approaches designed to identify claims within text may mis-characterize these "hedged" statements as facts.

Supporting clinical needs

Biomedical text mining applications developed for clinical use should ideally reflect the needs and demands of clinicians. This is a concern in environments where clinical decision support is expected to be informative and accurate.

Interoperability with clinical systems

New text mining systems must work with existing standards, electronic medical records, and databases. Methods for interfacing with clinical systems such as LOINC have been developed but require extensive organizational effort to implement and maintain.

Patient privacy

Text mining systems operating with private medical data must respect its security and ensure it is rendered anonymous where appropriate.

Processes

Specific sub tasks are of particular concern when processing biomedical text.

Named entity recognition

Developments in biomedical text mining have incorporated identification of biological entities with named entity recognition, or NER. Names and identifiers for biomolecules such as proteins and genes, chemical compounds and drugs, and disease names have all been used as entities. Most entity recognition methods are supported by pre-defined linguistic features or vocabularies, though methods incorporating deep learning and word embeddings have also been successful at biomedical NER.

Document classification and clustering

Biomedical documents may be classified or clustered based on their contents and topics. In classification, document categories are specified manually, while in clustering, documents form algorithm-dependent, distinct groups. These two tasks are representative of supervised and unsupervised methods, respectively, yet the goal of both is to produce subsets of documents based on their distinguishing features. Methods for biomedical document clustering have relied upon k-means clustering.

Relationship discovery

Biomedical documents describe connections between concepts, whether they are interactions between biomolecules, events occurring subsequently over time (i.e., temporal relationships), or causal relationships. Text mining methods may perform relation discovery to identify these connections, often in concert with named entity recognition.

Hedge cue detection

The challenge of identifying uncertain or "hedged" statements has been addressed through hedge cue detection in biomedical literature.

Claim detection

Multiple researchers have developed methods to identify specific scientific claims from literature. In practice, this process involves both isolating phrases and sentences denoting the core arguments made by the authors of a document (a process known as argument mining, employing tools used in fields such as political science) and comparing claims to find potential contradictions between them.

Information extraction

Information extraction, or IE, is the process of automatically identifying structured information from unstructured or partially structured text. IE processes can involve several or all of the above activities, including named entity recognition, relationship discovery, and document classification, with the overall goal of translating text to a more structured form, such as the contents of a template or knowledge base. In the biomedical domain, IE is used to generate links between concepts described in text, such as gene A inhibits gene B and gene C is involved in disease G. Biomedical knowledge bases containing this type of information are generally products of extensive manual curation, so replacement of manual efforts with automated methods remains a compelling area of research.

Information retrieval and question answering

Biomedical text mining supports applications for identifying documents and concepts matching search queries. Search engines such as PubMed search allow users to query literature databases with words or phrases present in document contents, metadata, or indices such as MeSH. Similar approaches may be used for medical literature retrieval. For more fine-grained results, some applications permit users to search with natural language queries and identify specific biomedical relationships.

On 16 March 2020, the National Library of Medicine and others launched the COVID-19 Open Research Dataset (CORD-19) to enable text mining of the current literature on the novel virus. The dataset is hosted by the Semantic Scholar project of the Allen Institute for AI. Other participants include Google, Microsoft Research, the Center for Security and Emerging Technology, and the Chan Zuckerberg Initiative.

Resources

Word embeddings

Several groups have developed sets of biomedical vocabulary mapped to vectors of real numbers, known as word vectors or word embeddings. Sources of pre-trained embeddings specific for biomedical vocabulary are listed in the table below. The majority are results of the word2vec model developed by Mikolov et al or variants of word2vec.

Applications

A flowchart of a text mining protocol.
An example of a text mining protocol used in a study of protein-protein complexes, or protein docking.

Text mining applications in the biomedical field include computational approaches to assist with studies in protein docking, protein interactions, and protein-disease associations.

Gene cluster identification

Methods for determining the association of gene clusters obtained by microarray experiments with the biological context provided by the corresponding literature have been developed.

Protein interactions

Automatic extraction of protein interactions and associations of proteins to functional concepts (e.g. gene ontology terms) has been explored. The search engine PIE was developed to identify and return protein-protein interaction mentions from MEDLINE-indexed articles. The extraction of kinetic parameters from text or the subcellular location of proteins have also been addressed by information extraction and text mining technology.

Gene-disease associations

Text mining can aid in gene prioritization, or identification of genes most likely to contribute to genetic disease. One group compared several vocabularies, representations and ranking algorithms to develop gene prioritization benchmarks.

Gene-trait associations

An agricultural genomics group identified genes related to bovine reproductive traits using text mining, among other approaches.

Protein-disease associations

Text mining enables an unbiased evaluation of protein-disease relationships within a vast quantity of unstructured textual data.

Applications of phrase mining to disease associations

A text mining study assembled a collection of 709 core extracellular matrix proteins and associated proteins based on two databases: MatrixDB (matrixdb.univ-lyon1.fr) and UniProt. This set of proteins had a manageable size and a rich body of associated information, making it a suitable for the application of text mining tools. The researchers conducted phrase-mining analysis to cross-examine individual extracellular matrix proteins across the biomedical literature concerned with six categories of cardiovascular diseases. They used a phrase-mining pipeline, Context-aware Semantic Online Analytical Processing (CaseOLAP), then semantically scored all 709 proteins according to their Integrity, Popularity, and Distinctiveness using the CaseOLAP pipeline. The text mining study validated existing relationships and informed previously unrecognized biological processes in cardiovascular pathophysiology.

Software tools

Search engines

Search engines designed to retrieve biomedical literature relevant to a user-provided query frequently rely upon text mining approaches. Publicly available tools specific for research literature include PubMed search, Europe PubMed Central search, GeneView, and APSE Similarly, search engines and indexing systems specific for biomedical data have been developed, including DataMed and OmicsDI.

Some search engines, such as Essie, OncoSearch, PubGene, and GoPubMed were previously public but have since been discontinued, rendered obsolete, or integrated into commercial products.

Medical record analysis systems

Electronic medical records (EMRs) and electronic health records (EHRs) are collected by clinical staff in the course of diagnosis and treatment. Though these records generally include structured components with predictable formats and data types, the remainder of the reports are often free-text. Numerous complete systems and tools have been developed to analyse these free-text portions. The MedLEE system was originally developed for analysis of chest radiology reports but later extended to other report topics. The clinical Text Analysis and Knowledge Extraction System, or cTAKES, annotates clinical text using a dictionary of concepts. The CLAMP system offers similar functionality with a user-friendly interface.

Frameworks

Computational frameworks have been developed to rapidly build tools for biomedical text mining tasks. SwellShark is a framework for biomedical NER that requires no human-labeled data but does make use of resources for weak supervision (e.g., UMLS semantic types). The SparkText framework uses Apache Spark data streaming, a NoSQL database, and basic machine learning methods to build predictive models from scientific articles.

APIs

Some biomedical text mining and natural language processing tools are available through application programming interfaces, or APIs. NOBLE Coder performs concept recognition through an API.

Text mining

From Wikipedia, the free encyclopedia

Text mining, also referred to as text data mining, similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources." Written resources may include websites, books, emails, reviews, and articles. High-quality information is typically obtained by devising patterns and trends by means such as statistical pattern learning. According to Hotho et al. (2005) we can differ three different perspectives of text mining: information extraction, data mining, and a KDD (Knowledge Discovery in Databases) process. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interest. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).

Text analysis involves information retrieval, lexical analysis to study word frequency distributions, pattern recognition, tagging/annotation, information extraction, data mining techniques including link and association analysis, visualization, and predictive analytics. The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP), different types of algorithms and analytical methods. An important phase of this process is the interpretation of the gathered information.

A typical application is to scan a set of documents written in a natural language and either model the document set for predictive classification purposes or populate a database or search index with the information extracted. The document is the basic element while starting with text mining. Here, we define a document as a unit of textual data, which normally exists in many types of collections.

Text analytics

The term text analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation. The term is roughly synonymous with text mining; indeed, Ronen Feldman modified a 2000 description of "text mining" in 2004 to describe "text analytics". The latter term is now used more frequently in business settings while "text mining" is used in some of the earliest application areas, dating to the 1980s, notably life-sciences research and government intelligence.

The term text analytics also describes that application of text analytics to respond to business problems, whether independently or in conjunction with query and analysis of fielded, numerical data. It is a truism that 80 percent of business-relevant information originates in unstructured form, primarily text. These techniques and processes discover and present knowledge – facts, business rules, and relationships – that is otherwise locked in textual form, impenetrable to automated processing.

Text analysis processes

Subtasks—components of a larger text-analytics effort—typically include:

  • Dimensionality reduction is important technique for pre-processing data. Technique is used to identify the root word for actual words and reduce the size of the text data.[citation needed]
  • Information retrieval or identification of a corpus is a preparatory step: collecting or identifying a set of textual materials, on the Web or held in a file system, database, or content corpus manager, for analysis.
  • Although some text analytics systems apply exclusively advanced statistical methods, many others apply more extensive natural language processing, such as part of speech tagging, syntactic parsing, and other types of linguistic analysis.
  • Named entity recognition is the use of gazetteers or statistical techniques to identify named text features: people, organizations, place names, stock ticker symbols, certain abbreviations, and so on.
  • Disambiguation—the use of contextual clues—may be required to decide where, for instance, "Ford" can refer to a former U.S. president, a vehicle manufacturer, a movie star, a river crossing, or some other entity.
  • Recognition of Pattern Identified Entities: Features such as telephone numbers, e-mail addresses, quantities (with units) can be discerned via regular expression or other pattern matches.
  • Document clustering: identification of sets of similar text documents.
  • Coreference: identification of noun phrases and other terms that refer to the same object.
  • Relationship, fact, and event Extraction: identification of associations among entities and other information in text
  • Sentiment analysis involves discerning subjective (as opposed to factual) material and extracting various forms of attitudinal information: sentiment, opinion, mood, and emotion. Text analytics techniques are helpful in analyzing sentiment at the entity, concept, or topic level and in distinguishing opinion holder and opinion object.
  • Quantitative text analysis is a set of techniques stemming from the social sciences where either a human judge or a computer extracts semantic or grammatical relationships between words in order to find out the meaning or stylistic patterns of, usually, a casual personal text for the purpose of psychological profiling etc.

Applications

Text mining technology is now broadly applied to a wide variety of government, research, and business needs. All these groups may use text mining for records management and searching documents relevant to their daily activities. Legal professionals may use text mining for e-discovery, for example. Governments and military groups use text mining for national security and intelligence purposes. Scientific researchers incorporate text mining approaches into efforts to organize large sets of text data (i.e., addressing the problem of unstructured data), to determine ideas communicated through text (e.g., sentiment analysis in social media) and to support scientific discovery in fields such as the life sciences and bioinformatics. In business, applications are used to support competitive intelligence and automated ad placement, among numerous other activities.

Security applications

Many text mining software packages are marketed for security applications, especially monitoring and analysis of online plain text sources such as Internet news, blogs, etc. for national security purposes. It is also involved in the study of text encryption/decryption.

Biomedical applications

A flowchart of a text mining protocol.
An example of a text mining protocol used in a study of protein-protein complexes, or protein docking.

A range of text mining applications in the biomedical literature has been described, including computational approaches to assist with studies in protein docking, protein interactions, and protein-disease associations. In addition, with large patient textual datasets in the clinical field, datasets of demographic information in population studies and adverse event reports, text mining can facilitate clinical studies and precision medicine. Text mining algorithms can facilitate the stratification and indexing of specific clinical events in large patient textual datasets of symptoms, side effects, and comorbidities from electronic health records, event reports, and reports from specific diagnostic tests. One online text mining application in the biomedical literature is PubGene, a publicly accessible search engine that combines biomedical text mining with network visualization. GoPubMed is a knowledge-based search engine for biomedical texts. Text mining techniques also enable us to extract unknown knowledge from unstructured documents in the clinical domain

Software applications

Text mining methods and software is also being researched and developed by major firms, including IBM and Microsoft, to further automate the mining and analysis processes, and by different firms working in the area of search and indexing in general as a way to improve their results. Within public sector much effort has been concentrated on creating software for tracking and monitoring terrorist activities. For study purposes, Weka software is one of the most popular options in the scientific world, acting as an excellent entry point for beginners. For Python programmers, there is an excellent toolkit called NLTK for more general purposes. For more advanced programmers, there's also the Gensim library, which focuses on word embedding-based text representations.

Online media applications

Text mining is being used by large media companies, such as the Tribune Company, to clarify information and to provide readers with greater search experiences, which in turn increases site "stickiness" and revenue. Additionally, on the back end, editors are benefiting by being able to share, associate and package news across properties, significantly increasing opportunities to monetize content.

Business and marketing applications

Text analytics is being used in business, particularly, in marketing, such as in customer relationship management. Coussement and Van den Poel (2008) apply it to improve predictive analytics models for customer churn (customer attrition). Text mining is also being applied in stock returns prediction.

Sentiment analysis

Sentiment analysis may involve analysis of movie reviews for estimating how favorable a review is for a movie. Such an analysis may need a labeled data set or labeling of the affectivity of words. Resources for affectivity of words and concepts have been made for WordNet and ConceptNet, respectively.

Text has been used to detect emotions in the related area of affective computing. Text based approaches to affective computing have been used on multiple corpora such as students evaluations, children stories and news stories.

Scientific literature mining and academic applications

The issue of text mining is of importance to publishers who hold large databases of information needing indexing for retrieval. This is especially true in scientific disciplines, in which highly specific information is often contained within the written text. Therefore, initiatives have been taken such as Nature's proposal for an Open Text Mining Interface (OTMI) and the National Institutes of Health's common Journal Publishing Document Type Definition (DTD) that would provide semantic cues to machines to answer specific queries contained within the text without removing publisher barriers to public access.

Academic institutions have also become involved in the text mining initiative:

Methods for scientific literature mining

Computational methods have been developed to assist with information retrieval from scientific literature. Published approaches include methods for searching, determining novelty, and clarifying homonyms among technical reports.

Digital humanities and computational sociology

The automatic analysis of vast textual corpora has created the possibility for scholars to analyze millions of documents in multiple languages with very limited manual intervention. Key enabling technologies have been parsing, machine translation, topic categorization, and machine learning.

Narrative network of US Elections 2012

The automatic parsing of textual corpora has enabled the extraction of actors and their relational networks on a vast scale, turning textual data into network data. The resulting networks, which can contain thousands of nodes, are then analyzed by using tools from network theory to identify the key actors, the key communities or parties, and general properties such as robustness or structural stability of the overall network, or centrality of certain nodes. This automates the approach introduced by quantitative narrative analysis, whereby subject-verb-object triplets are identified with pairs of actors linked by an action, or pairs formed by actor-object.

Content analysis has been a traditional part of social sciences and media studies for a long time. The automation of content analysis has allowed a "big data" revolution to take place in that field, with studies in social media and newspaper content that include millions of news items. Gender bias, readability, content similarity, reader preferences, and even mood have been analyzed based on text mining methods over millions of documents. The analysis of readability, gender bias and topic bias was demonstrated in Flaounas et al. showing how different topics have different gender biases and levels of readability; the possibility to detect mood patterns in a vast population by analyzing Twitter content was demonstrated as well.

Software

Text mining computer programs are available from many commercial and open source companies and sources.

Intellectual property law

Situation in Europe

Video by Fix Copyright campaign explaining TDM and its copyright issues in the EU, 2016 [3:52

Under European copyright and database laws, the mining of in-copyright works (such as by web mining) without the permission of the copyright owner is illegal. In the UK in 2014, on the recommendation of the Hargreaves review, the government amended copyright law to allow text mining as a limitation and exception. It was the second country in the world to do so, following Japan, which introduced a mining-specific exception in 2009. However, owing to the restriction of the Information Society Directive (2001), the UK exception only allows content mining for non-commercial purposes. UK copyright law does not allow this provision to be overridden by contractual terms and conditions.

The European Commission facilitated stakeholder discussion on text and data mining in 2013, under the title of Licenses for Europe. The fact that the focus on the solution to this legal issue was licenses, and not limitations and exceptions to copyright law, led representatives of universities, researchers, libraries, civil society groups and open access publishers to leave the stakeholder dialogue in May 2013.

Situation in the United States

US copyright law, and in particular its fair use provisions, means that text mining in America, as well as other fair use countries such as Israel, Taiwan and South Korea, is viewed as being legal. As text mining is transformative, meaning that it does not supplant the original work, it is viewed as being lawful under fair use. For example, as part of the Google Book settlement the presiding judge on the case ruled that Google's digitization project of in-copyright books was lawful, in part because of the transformative uses that the digitization project displayed—one such use being text and data mining.

Implications

Until recently, websites most often used text-based searches, which only found documents containing specific user-defined words or phrases. Now, through use of a semantic web, text mining can find content based on meaning and context (rather than just by a specific word). Additionally, text mining software can be used to build large dossiers of information about specific people and events. For example, large datasets based on data extracted from news reports can be built to facilitate social networks analysis or counter-intelligence. In effect, the text mining software may act in a capacity similar to an intelligence analyst or research librarian, albeit with a more limited scope of analysis. Text mining is also used in some email spam filters as a way of determining the characteristics of messages that are likely to be advertisements or other unwanted material. Text mining plays an important role in determining financial market sentiment.

Future

Increasing interest is being paid to multilingual data mining: the ability to gain information across languages and cluster similar items from different linguistic sources according to their meaning.

The challenge of exploiting the large proportion of enterprise information that originates in "unstructured" form has been recognized for decades. It is recognized in the earliest definition of business intelligence (BI), in an October 1958 IBM Journal article by H.P. Luhn, A Business Intelligence System, which describes a system that will:

"...utilize data-processing machines for auto-abstracting and auto-encoding of documents and for creating interest profiles for each of the 'action points' in an organization. Both incoming and internally generated documents are automatically abstracted, characterized by a word pattern, and sent automatically to appropriate action points."

Yet as management information systems developed starting in the 1960s, and as BI emerged in the '80s and '90s as a software category and field of practice, the emphasis was on numerical data stored in relational databases. This is not surprising: text in "unstructured" documents is hard to process. The emergence of text analytics in its current form stems from a refocusing of research in the late 1990s from algorithm development to application, as described by Prof. Marti A. Hearst in the paper Untangling Text Data Mining:

For almost a decade the computational linguistics community has viewed large text collections as a resource to be tapped in order to produce better text analysis algorithms. In this paper, I have attempted to suggest a new emphasis: the use of large online text collections to discover new facts and trends about the world itself. I suggest that to make progress we do not need fully artificial intelligent text analysis; rather, a mixture of computationally-driven and user-guided analysis may open the door to exciting new results.

Hearst's 1999 statement of need fairly well describes the state of text analytics technology and practice a decade later.

Speech translation

From Wikipedia, the free encyclopedia

Speech translation is the process by which conversational spoken phrases are instantly translated and spoken aloud in a second language. This differs from phrase translation, which is where the system only translates a fixed and finite set of phrases that have been manually entered into the system. Speech translation technology enables speakers of different languages to communicate. It thus is of tremendous value for humankind in terms of science, cross-cultural exchange and global business.

How it works

A speech translation system would typically integrate the following three software technologies: automatic speech recognition (ASR), machine translation (MT) and voice synthesis (TTS).

The speaker of language A speaks into a microphone and the speech recognition module recognizes the utterance. It compares the input with a phonological model, consisting of a large corpus of speech data from multiple speakers. The input is then converted into a string of words, using dictionary and grammar of language A, based on a massive corpus of text in language A.

The machine translation module then translates this string. Early systems replaced every word with a corresponding word in language B. Current systems do not use word-for-word translation, but rather take into account the entire context of the input to generate the appropriate translation. The generated translation utterance is sent to the speech synthesis module, which estimates the pronunciation and intonation matching the string of words based on a corpus of speech data in language B. Waveforms matching the text are selected from this database and the speech synthesis connects and outputs them.

History

In 1983, NEC Corporation demonstrated speech translation as a concept exhibit at the ITU Telecom World (Telecom '83).

In 1999, the C-Star-2 consortium demonstrated speech-to-speech translation of 5 languages including English, Japanese, Italian, Korean, and German.

Features

Apart from the problems involved in the text translation, it also has to deal with special problems occur in speech-to-speech translation, incorporating incoherence of spoken language, fewer grammar constraints of spoken language, unclear word boundary of spoken language, the correction of speech recognition errors and multiple optional inputs. Additionally, speech-to-speech translation also has its advantages compared with text translation, including less complex structure of spoken language and less vocabulary in spoken language.

Research and development

Research and development has gradually progressed from relatively simple to more advanced translation. International evaluation workshops were established to support the development of speech-translation technology. They allow research institutes to cooperate and compete against each other at the same time. The concept of those workshop is a kind of contest: a common dataset is provided by the organizers and the participating research institutes create systems that are evaluated. In this way, efficient research is being promoted.

The International Workshop on Spoken Language Translation (IWSLT), organized by C-STAR, an international consortium for research on speech translation, has been held since 2004. "Every year, the number of participating institutes increases, and it has become a key event for speech translation research."

Standards

When many countries begin to research and develop speech translation, it will be necessary to standardize interfaces and data formats to ensure that the systems are mutually compatible. International joint research is being fostered by speech translation consortiums (e.g. the C-STAR international consortium for joint research of speech translation and A-STAR for the Asia-Pacific region). They were founded as "international joint-research organization[s] to design formats of bilingual corpora that are essential to advance the research and development of this technology ... and to standardize interfaces and data formats to connect speech translation module internationally".

Applications

Today, speech translation systems are being used throughout the world. Examples include medical facilities, schools, police, hotels, retail stores, and factories. These systems are applicable anywhere that spoken language is being used to communicate. A popular application is Jibbigo that works offline.

Challenges and future prospects

Currently, speech translation technology is available as product that instantly translates free form multi-lingual conversations. These systems instantly translate continuous speech. Challenges in accomplishing this include overcoming speaker-dependent variations in style of speaking or pronunciation are issues that have to be dealt with in order to provide high quality translation for all users. Moreover, speech recognition systems must be able to remedy external factors such as acoustic noise or speech by other speakers in real-world use of speech translation systems.

For the reason that the user does not understand the target language when speech translation is used, a method "must be provided for the user to check whether the translation is correct, by such means as translating it again back into the user's language". In order to achieve the goal of erasing the language barrier worldwide, multiple languages have to be supported. This requires speech corpora, bilingual corpora and text corpora for each of the estimated 6,000 languages said to exist on our planet today.

As the collection of corpora is extremely expensive, collecting data from the Web would be an alternative to conventional methods. "Secondary use of news or other media published in multiple languages would be an effective way to improve performance of speech translation." However, "current copyright law does not take secondary uses such as these types of corpora into account" and thus "it will be necessary to revise it so that it is more flexible."

Universal translator

From Wikipedia, the free encyclopedia

A universal translator is a device common to many science fiction works, especially on television. First described in Murray Leinster's 1945 novella "First Contact", the translator's purpose is to offer an instant translation of any language.

As a convention, it is used to remove the problem of translating between alien languages, unless that problem is essential to the plot. To translate a new language in every episode when a new species or culture is encountered would consume time (especially when most of these shows have a half-hour or one-hour format) normally allotted for plot development and would potentially, across many episodes, become repetitive to the point of annoyance. Occasionally, alien races are portrayed as being able to extrapolate the rules of English from little speech and then immediately be fluent in it, making the translator unnecessary.

While a universal translator seems unlikely, due to the apparent need for telepathy, scientists continue to work towards similar real-world technologies involving small numbers of known languages.

General

As a rule, a universal translator is instantaneous, but if that language has never been recorded, there is sometimes a time delay until the translator can properly work out a translation, as is true of Star Trek. The operation of these translators is often explained as using some form of telepathy by reading the brain patterns of the speaker(s) to determine what they are saying; some writers seek greater plausibility by instead having computer translation that requires collecting a database of the new language, often by listening to radio transmissions.

The existence of a universal translator tends to raise questions from a logical perspective, such as:

  • The continued functioning of the translator even when no device is evident;
  • Multiple speakers hear speech in one and only one language (so for example, for a Spanish speaker and a German speaker listening to an Italian speaker the Spanish speaker would only hear Spanish and neither the original Italian nor the translated German, while the German speaker would not hear any Spanish nor Italian but only German);
  • Characters' mouths move in sync with the translated words and not the original language;
  • The ability for the translator to function in real-time even for languages with different word order (such as a phrase the horse standing in front of the barn would end up in Japanese as 納屋の前に立っている馬, lit. barn-in-front-at-standing-horse, however there is no delay for the Japanese listener even when the English speaker has yet to mention the barn).

Nonetheless, it removes the need for cumbersome and potentially extensive subtitles, and it eliminates the rather unlikely supposition that every other race in the galaxy has gone to the trouble of learning English.

Fictional depictions

Doctor Who

Using a telepathic field, the TARDIS automatically translates most comprehensible languages (written and spoken) into a language understood by its pilot and each of the crew members. The field also translates what they say into a language appropriate for that time and location (e.g., speaking the appropriate dialect of Latin when in ancient Rome). This system has frequently been featured as a main part of the show. The TARDIS, and by extension a number of its major systems, including the translator, are telepathically linked to its pilot, the Doctor. None of these systems appear able to function reliably when the Doctor is incapacitated. In "The Fires of Pompeii", when companion Donna Noble attempts to speak the local language directly, her words are humorously rendered into what sounds to a local like Welsh. One flaw of this translation process is that if the language that a word is being translated into does not have a concept for it, e.g, the Romans don't have a word or a general understanding of "volcano" as Mt. Vesuvius has not erupted yet.

Farscape

On the TV show Farscape, John Crichton is injected with bacteria called translator microbes which function as a sort of universal translator. The microbes colonize the host's brainstem and translate anything spoken to the host, passing along the translated information to the host's brain. This does not enable the injected person to speak other languages; they continue to speak their own language and are only understood by others as long as the listeners possess the microbes. The microbes sometimes fail to properly translate slang, translating it literally. Also, the translator microbes cannot translate the natural language of the alien Pilots or Diagnosans because every word in their language can contain thousands of meanings, far too many for the microbes to translate; thus Pilots must learn to speak in "simple sentences", while Diagnosans require interpreters. The implanted can learn to speak new languages if they want or to make communicating with non-injected individuals possible. The crew of Moya learned English from Crichton, thereby being able to communicate with the non-implanted populace when the crew visited Earth. Some species, such as the Kalish, cannot use translator microbes because their body rejects them, so they must learn a new language through their own efforts.

The Hitchhiker's Guide to the Galaxy

In the universe of The Hitchhiker's Guide to the Galaxy, universal translation is made possible by a small fish called a "babel fish". The fish is inserted into the auditory canal where it feeds off the mental frequencies of those speaking to its host. In turn it excretes a translation into the brain of its host.

The book remarks that, by allowing everyone to understand each other, the babel fish has caused more wars than anything else in the universe.

The book also explains that the babel fish could not possibly have developed naturally and therefore proves the existence of God as its creator, which in turn proves the non-existence of God. Since God needs faith to exist, and this proof dispels the need for faith, this therefore causes God to vanish "in a puff of logic".

Men in Black

The Men in Black franchise possess a universal translator, which, as Agent K explains in the first film, Men in Black, they are not allowed to have because "human thought is so primitive, it's looked upon as an infectious disease in some of the better galaxies." remarking “That kinda makes you proud, doesn’t it?”

Neuromancer

In William Gibson's novel Neuromancer, along with the other novels in his Sprawl trilogy, Count Zero and Mona Lisa Overdrive, devices known as "microsofts" are small chips plugged into "wetware" sockets installed behind the user's ear, giving them certain knowledge and/or skills as long as they are plugged in, such as the ability to speak another language. (The name is a combination of the words "micro" and "soft", and is not named after the software firm Microsoft.)

Star Control

In the Star Control computer game series, almost all races are implied to have universal translators; however, discrepancies between the ways aliens choose to translate themselves sometimes crop up and complicate communications. The VUX, for instance, are cited as having uniquely advanced skills in linguistics and can translate human language long before humans are capable of doing the same to the VUX. This created a problem during the first contact between Vux and humans, in a starship commanded by Captain Rand. According to Star Control: Great Battles of the Ur-Quan Conflict, Captain Rand is referred to as saying "That is one ugly sucker" when the image of a VUX first came onto his viewscreen. However, in Star Control II, Captain Rand is referred to as saying "That is the ugliest freak-face I've ever seen" to his first officer, along with joking that the VUX name stands for Very Ugly Xenoform. It is debatable which source is canon. Whichever the remark, it is implied that the VUX's advanced Universal Translator technologies conveyed the exact meaning of Captain Rand's words. The effete VUX used the insult as an excuse for hostility toward humans.

Also, a new race called the Orz was introduced in Star Control II. They presumably come from another dimension, and at first contact, the ship's computer says that there are many vocal anomalies in their language resulting from their referring to concepts or phenomena for which there are no equivalents in human language. The result is dialogue that is a patchwork of ordinary words and phrases marked with *asterisk pairs* indicating that they are loose translations of unique Orz concepts into human language, a full translation of which would probably require paragraph-long definitions. (For instance, the Orz refer to the human dimension as *heavy space* and their own as *Pretty Space*, to various categories of races as *happy campers* or *silly cows*, and so on.)

In the other direction, the Supox are a race portrayed as attempting to mimic as many aspects of other races' language and culture as possible when speaking to them, to the point of referring to their own planet as “Earth,” also leading to confusion.

In Star Control III, the K’tang are portrayed as an intellectually inferior species using advanced technology they do not fully understand to intimidate people, perhaps explaining why their translators’ output is littered with misspellings and nonstandard usages of words, like threatening to “crushify” the player. Along the same lines, the Daktaklakpak dialogue is highly stilted and contains many numbers and mathematical expressions, implying that, as a mechanical race, their thought processes are inherently too different from humans’ to be directly translated into human language.

Stargate

In the television shows Stargate SG-1 and Stargate Atlantis, there are no personal translation devices used, and most alien and Human cultures on other planets speak English. The makers of the show have themselves admitted this on the main SG-1 site, stating that this is to save spending ten minutes an episode on characters learning a new language (early episodes of SG-1 revealed the difficulties of attempting to write such processes into the plot). In the season 8 finale of SG-1, “Moebius (Part II),” the characters go back in time to 3000 B.C. and one of them teaches English to the people there.

A notable exception to this rule are the Goa’uld, who occasionally speak their own language amongst themselves or when giving orders to their Jaffa. This is never subtitled, but occasionally a translation is given by a third character (usually Teal’c or Daniel Jackson), ostensibly for the benefit of the human characters nearby who do not speak Goa’uld. The Asgard are also shown having their own language (apparently related to the Norse languages), although it is English played backwards.

In contrast a major plot element of the original Stargate film was that Daniel Jackson had to learn the language of the people of Abydos in the common way, which turned out to be derived from ancient Egyptian. The language had been extinct on Earth for many millennia, but Jackson eventually realized that it was merely errors in pronunciation that prevented effective communication.

Star Trek

In Star Trek, the universal translator was used by Ensign Hoshi Sato, the communications officer on the Enterprise in Star Trek: Enterprise, to invent the linguacode matrix. It was supposedly first used in the late 22nd century on Earth for the instant translation of well-known Earth languages. Gradually, with the removal of language barriers, Earth's disparate cultures came to terms of universal peace. Translations of previously unknown languages, such as those of aliens, required more difficulties to be overcome.

Like most other common forms of Star Trek technology (warp drive, transporters, etc.), the universal translator was probably developed independently on several worlds as an inevitable requirement of space travel; certainly the Vulcans had no difficulty communicating with humans upon making "first contact" (although the Vulcans could have learned Standard English from monitoring Earth radio transmissions). The Vulcan ship that landed during First Contact was a survey vessel. The Vulcans had been surveying the humans for over a hundred years, when first contact actually occurred to T'Pol's great-grandmother, T'mir, in the episode "Carbon Creek"; however, in Star Trek First Contact it is implied that they learned English by surveying the planets in the Solar System. Deanna Troi mentions the Vulcans have no interest in Earth as it is "too primitive", but the Prime Directive states not to interfere with pre-Warp species. The Vulcans only noticed the warp trail and came to investigate.

Improbably, the universal translator has been successfully used to interpret non-biological lifeform communication (in the Original Series episode "Metamorphosis"). In the Star Trek: The Next Generation (TNG) episode "The Ensigns of Command", the translator proved ineffective with the language of the Sheliaks, so the Federation had to depend on the aliens' interpretation of Earth languages. It is speculated that the Sheliak communicate amongst themselves in extremely complex legalese. The TNG episode "Darmok" also illustrates another instance where the universal translator proves ineffective and unintelligible, because the Tamarian language is too deeply rooted in local metaphor.

Unlike virtually every other form of Federation technology, universal translators almost never break down. A notable exception is in the Star Trek: Discovery episode "An Obol for Charon", where alien interference causes the translator to malfunction and translate crew speech and computer text into multiple languages at random, requiring Commander Saru's fluency in nearly one hundred languages to repair the problem. Although universal translators were clearly in widespread use during this era and Captain Kirk's time (inasmuch as the crew regularly communicated with species who could not conceivably have knowledge of Standard English), it is unclear where they were carried on personnel of that era.

The episode "Metamorphosis" was the only time in which the device was actually seen. In the episode "Arena" the Metrons supply Captain Kirk and the Gorn commander with a Translator-Communicator, allowing conversation between them to be possible. During Kirk's era, they were also apparently less perfect in their translations into Klingon. In the sixth Star Trek film, the characters are seen relying on print books in order to communicate with a Klingon military ship, since Chekov said that the Klingons would recognize the use of a Translator. Actress Nichelle Nichols reportedly protested this scene, as she felt that Uhura, as communications officer during what was effectively a cold war, would be trained in fluent Klingon to aid in such situations. In that same movie during the trial scene of Kirk and McCoy before a Klingon judiciary, the Captain and the Doctor are holding communication devices while a Klingon (played by Todd Bryant) translates for them. The novelization of that movie provided a different reason for the use of books: sabotage by somebody working on the Starfleet side of the conspiracy uncovered by the crew in the story, but the novelization is not part of the Star Trek canon.

By the 24th century, universal translators are built into the communicator pins worn by Starfleet personnel, although there were instances when crew members (such as Riker in the Next Generation episode "First Contact") spoke to newly encountered aliens even when deprived of their communicators. In the Star Trek: Voyager episode "The 37's" the device apparently works among intra-species languages as well; after the Voyager crew discovers and revives eight humans abducted in 1937 (including Amelia Earhart and Fred Noonan) and held in stasis since then, a Japanese Army officer expresses surprise that an Ohio farmer is apparently speaking Japanese, while the farmer is equally surprised to hear the soldier speaking English (the audience hears them all speaking English only, however). Certain Starfleet programs, such as the Emergency Medical Hologram, have universal translators encoded into the programming.

The Star Trek: The Next Generation Technical Manual says that the universal translator is an "extremely sophisticated computer program" which functions by "analyzing the patterns" of an unknown foreign language, starting from a speech sample of two or more speakers in conversation. The more extensive the conversational sample, the more accurate and reliable is the "translation matrix", enabling instantaneous conversion of verbal utterances or written text between the alien language and American English / Federation Standard.

In some episodes of Star Trek: Deep Space Nine, we see a Cardassian universal translator at work. It takes some time to process an alien language, whose speakers are initially not understandable but as they continue speaking, the computer gradually learns their language and renders it into Standard English (also known as Federation Standard).

Ferengi customarily wear their universal translators as an implant in their ears. In the Star Trek: Deep Space Nine (DS9) episode "Little Green Men", in which the show's regular Ferengi accidentally become the three aliens in Roswell, the humans without translators are unable to understand the Ferengi (who likewise can not understand the English spoken by the human observers) until the Ferengi get their own translators working. Similarly, throughout all Trek series, a universal translator possessed by only one party can audibly broadcast the results within a limited range, enabling communication between two or more parties, all speaking different languages. The devices appear to be standard equipment on starships and space stations, where a communicator pin would therefore presumably not be strictly necessary.

Since the Universal Translator presumably does not physically affect the process by which the user's vocal cords (or alien equivalent) forms audible speech (i.e. the user is nonetheless speaking in his/her/its own language regardless of the listener's language), the listener apparently hears only the speaker's translated words and not the alien language that the speaker is actually, physically articulating; the unfamiliar oratory is therefore not only translated but somehow replaced. The universal translator is often used in cases of contact with pre-warp societies such as in the Star Trek: The Next Generation episode "Who Watches the Watchers", and its detection could conceivably lead to a violation of the Prime Directive. Therefore, logically there must be some mechanism by which the lips of the speaker are perceived to be in sync with the words spoken. No explanation of the mechanics of this function appears to have been provided; the viewer is required to suspend disbelief enough to overcome the apparent limitation.

Non-fictional translators

Microsoft is developing its own translation technology, for incorporation into many of their software products and services. Most notably this includes real-time translation of video calls with Skype Translator. As of July 2019, Microsoft Translator supports over 65 languages and can translate video calls between English, French, German, Chinese (Mandarin), Italian, and Spanish.

In 2010, Google announced that it was developing a translator. Using a voice recognition system and a database, a robotic voice will recite the translation in the desired language. Google's stated aim is to translate the entire world's information. Roya Soleimani, a spokesperson for Google, said during a 2013 interview demonstrating the translation app on a smartphone, "You can have access to the world's languages right in your pocket... The goal is to become that ultimate Star Trek computer." The United States Army has also developed a two-way translator for use in Iraq. TRANSTAC (Spoken Language Communication and Translation System for Tactical Use), though, only focuses on Arabic-English translation. The United States Army has scrapped the TRANSTAC Program and is developing in conjunction with DARPA, the BOLT (Broad Operational Language Translation) in its place.

In February 2010, a communications software called VoxOx launched a two-way translator service for instant messaging, SMS, email and social media titled the VoxOx Universal Translator. It enables two people to communicate instantly with each other while both typing in their native languages.

Friday, March 26, 2021

Proto-language

From Wikipedia, the free encyclopedia
 

Tree model of historical linguistics. The proto-languages stand at the branch points, or nodes: 15, 6, 20 and 7. The leaf languages, or end points, are 2, 5, 9 and 31. The root language is 15. By convention, the Proto-languages are named Proto-5-9, Proto-2-5-9 and Proto-31, or Common 5-9, etc. The overall Ursprache has a proto name reflecting the ordinary name of the entire family, such as Germanic, Italic, etc. The links between nodes indicate descent or genetic descent. All the languages in the tree are related. Nodes 6 and 20 are the daughters of 15, their parent. Nodes 6 and 20 are cognates or sister languages, etc. The leaf languages must be attested by some sort of documentation, even a lexical list of a few words. All the proto-languages are hypothetical, or reconstructed languages; however sometimes documentation is found that supports their former existence.

In the tree model of historical linguistics, a proto-language is a postulated language from which a number of attested known languages are believed to have descended by evolution, forming a language family. Proto-languages are usually unattested, hypothetical or reconstructed. In the family tree metaphor, a proto-language can be called a mother language. Occasionally, the German term Ursprache (from Ur- "primordial, original" and Sprache "language", pronounced [ˈuːɐ̯ʃpʁaːxə]) is used instead.

In the strict sense, a proto-language is the most recent common ancestor of a language family, immediately before the family started to diverge into the attested daughter languages. It is therefore equivalent with the ancestral language or parental language of a language family.

Moreover, a group of languages (such as a dialect cluster) which are not considered separate languages (for whichever reasons) may also be described as descending from a unitary proto-language.

Definition and verification

Typically, the proto-language is not known directly. It is by definition a linguistic reconstruction formulated by applying the comparative method to a group of languages featuring similar characteristics. The tree is a statement of similarity and a hypothesis that the similarity results from descent from a common language.

The comparative method, a process of deduction, begins from a set of characteristics, or characters, found in the attested languages. If the entire set can be accounted for by descent from the proto-language, which must contain the proto-forms of them all, the tree, or phylogeny, is regarded as a complete explanation and by Occam's razor, is given credibility. More recently such a tree has been termed "perfect" and the characters labelled "compatible".

No trees but the smallest branches are ever found to be perfect, in part because languages also evolve through horizontal transfer with their neighbours. Typically, credibility is given to the hypotheses of highest compatibility. The differences in compatibility must be explained by various applications of the wave model. The level of completeness of the reconstruction achieved varies, depending on how complete the evidence is from the descendant languages and on the formulation of the characters by the linguists working on it. Not all characters are suitable for the comparative method. For example, lexical items that are loans from a different language do not reflect the phylogeny to be tested, and if used will detract from the compatibility. Getting the right dataset for the comparative method is a major task in historical linguistics.

Some universally accepted proto-languages are Proto-Indo-European, Proto-Uralic, and Proto-Dravidian.

In a few fortuitous instances, which have been used to verify the method and the model (and probably ultimately inspired it), a literary history exists from as early as a few millennia ago, allowing the descent to be traced in detail. The early daughter languages, and even the proto-language itself, may be attested in surviving texts. For example, Latin is the proto-language of the Romance language family, which includes such modern languages as French, Italian, Portuguese, Romanian, Catalan and Spanish. Likewise, Proto-Norse, the ancestor of the modern Scandinavian languages, is attested, albeit in fragmentary form, in the Elder Futhark. Although there are no very early Indo-Aryan inscriptions, the Indo-Aryan languages of modern India all go back to Vedic Sanskrit (or dialects very closely related to it), which has been preserved in texts accurately handed down by parallel oral and written traditions for many centuries.

The first person to offer systematic reconstructions of an unattested proto-language was August Schleicher; he did so for Proto-Indo-European in 1861.

Proto-X vs. Pre-X

Normally, the term "Proto-X" refers to the last common ancestor of a group of languages, occasionally attested but most commonly reconstructed through the comparative method, as with Proto-Indo-European and Proto-Germanic. An earlier stage of a single language X, reconstructed through the method of internal reconstruction, is termed "Pre-X", as in Pre–Old Japanese. It is also possible to apply internal reconstruction to a proto-language, obtaining a pre-proto-language, such as Pre-Proto-Indo-European.

Both prefixes are sometimes used for an unattested stage of a language without reference to comparative or internal reconstruction. "Pre-X" is sometimes also used for a postulated substratum, as in the Pre-Indo-European languages believed to have been spoken in Europe and South Asia before the arrival there of Indo-European languages.

When multiple historical stages of a single language exist, the oldest attested stage is normally termed "Old X" (e.g. Old English and Old Japanese). In other cases, such as Old Irish and Old Norse, the term refers to the language of the oldest known significant texts. Each of these languages has an older stage (Primitive Irish and Proto-Norse respectively) that is attested only fragmentarily.

Accuracy

There are no objective criteria for the evaluation of different reconstruction systems yielding different proto-languages. Many researchers concerned with linguistic reconstruction agree that the traditional comparative method is an "intuitive undertaking."

The bias of the researchers regarding the accumulated implicit knowledge can also lead to erroneous assumptions and excessive generalization. Kortlandt (1993) offers several examples in where such general assumptions concerning "the nature of language" hindered research in the field of historical linguistics. Linguists make personal judgements on how they consider "natural" for a language to change, and

"[as] a result, our reconstructions tend to have a strong bias toward the average language type known to the investigator."

Such an investigator finds him- or herself blinkered by their own linguistic frame of reference.

The advent of the wave model raised new issues in the domain of linguistic reconstruction, causing the reevaluation of old reconstruction systems and depriving the proto-language of its "uniform character." This is evident in Karl Brugmann's skepticism that the reconstruction systems could ever reflect a linguistic reality. Ferdinand de Saussure would even express a more certain opinion, completely rejecting a positive specification of the sound values of reconstruction systems.

In general, the issue of the nature of proto-language remains unresolved, with linguists generally taking either the realist or abstractionist position. Even the widely studied proto-languages, such as Proto-Indo-European, have drawn criticism for being outliers typologically with respect to the reconstructed phonemic inventory. The alternatives such as glottalic theory, despite representing a typologically less rare system, have not gained wider acceptance, with some researchers even suggesting the use of indexes to represent the disputed series of plosives. On the other end of spectrum, Pulgram (1959:424) suggests that Proto-Indo-European reconstructions are just "a set of reconstructed formulae" and "not representative of any reality". In the same vein Julius Pokorny in his study on Indo-European claims that the linguistic term IE parent language is merely an abstraction that does not exist in reality, and it should be understood as consisting of dialects possibly dating back to the paleolithic era, in which these very dialects formed the linguistic structure of the IE language group. In his view, Indo-European is solely a system of isoglosses which bound together dialects which were operationalized by various tribes, from which the historically attested Indo-European languages emerged.

Inequality (mathematics)

From Wikipedia, the free encyclopedia https://en.wikipedia.org/wiki/Inequality...