Traditionally, computational linguistics was performed by computer scientists who had specialized in the application of computers to the processing of a natural language. Today, computational linguists often work as members of interdisciplinary teams, which can include regular linguists, experts in the target language, and computer scientists. In general, computational linguistics draws upon the involvement of linguists, computer scientists, experts in artificial intelligence, mathematicians, logicians, philosophers, cognitive scientists, cognitive psychologists, psycholinguists, anthropologists and neuroscientists, among others.
Computational linguistics has theoretical and applied components. Theoretical computational linguistics focuses on issues in theoretical linguistics and cognitive science, and applied computational linguistics focuses on the practical outcome of modeling human language use.
The Association for Computational Linguistics defines computational linguistics as:
...the scientific study of language from a computational perspective. Computational linguists are interested in providing computational models of various kinds of linguistic phenomena.
Origins
Computational
linguistics is often grouped within the field of artificial
intelligence, but actually was present before the development of
artificial intelligence. Computational linguistics originated with
efforts in the United States in the 1950s to use computers to
automatically translate texts from foreign languages, particularly
Russian scientific journals, into English. Since computers can make arithmetic
calculations much faster and more accurately than humans, it was
thought to be only a short matter of time before they could also begin
to process language.
Computational and quantitative methods are also used historically in
attempted reconstruction of earlier forms of modern languages and
subgrouping modern languages into language families. Earlier methods
such as lexicostatistics and glottochronology
have been proven to be premature and inaccurate. However, recent
interdisciplinary studies which borrow concepts from biological studies,
especially gene mapping, have proved to produce more sophisticated analytical tools and more trustworthy results.
When machine translation
(also known as mechanical translation) failed to yield accurate
translations right away, automated processing of human languages was
recognized as far more complex than had originally been assumed.
Computational linguistics was born as the name of the new field of study
devoted to developing algorithms and software for intelligently processing language data. The term "computational linguistics" itself was first coined by David Hays, founding member of both the Association for Computational Linguistics and the International Committee on Computational Linguistics.
When artificial intelligence came into existence in the 1960s, the
field of computational linguistics became that sub-division of
artificial intelligence dealing with human-level comprehension and
production of natural languages.
In order to translate one language into another, it was observed that one had to understand the grammar of both languages, including both morphology (the grammar of word forms) and syntax (the grammar of sentence structure). In order to understand syntax, one had to also understand the semantics and the lexicon (or 'vocabulary'), and even something of the pragmatics
of language use. Thus, what started as an effort to translate between
languages evolved into an entire discipline devoted to understanding how
to represent and process natural languages using computers.
Nowadays research within the scope of computational linguistics is done at computational linguistics departments, computational linguistics laboratories, computer science departments, and linguistics departments.
Some research in the field of computational linguistics aims to create
working speech or text processing systems while others aim to create a
system allowing human-machine interaction. Programs meant for
human-machine communication are called conversational agents.
Approaches
Just
as computational linguistics can be performed by experts in a variety
of fields and through a wide assortment of departments, so too can the
research fields broach a diverse range of topics. The following
sections discuss some of the literature available across the entire
field broken into four main area of discourse: developmental
linguistics, structural linguistics, linguistic production, and
linguistic comprehension.
Developmental approaches
Language
is a cognitive skill which develops throughout the life of an
individual. This developmental process has been examined using a number
of techniques, and a computational approach is one of them. Human language development does provide some constraints which make it harder to apply a computational method to understanding it. For instance, during language acquisition, human children are largely only exposed to positive evidence.
This means that during the linguistic development of an individual,
only evidence for what is a correct form is provided, and not evidence
for what is not correct. This is insufficient information for a simple
hypothesis testing procedure for information as complex as language,
and so provides certain boundaries for a computational approach to
modeling language development and acquisition in an individual.
Attempts have been made to model the developmental process of
language acquisition in children from a computational angle, leading to
both statistical grammars and connectionist models. Work in this realm has also been proposed as a method to explain the evolution of language
through history. Using models, it has been shown that languages can be
learned with a combination of simple input presented incrementally as
the child develops better memory and longer attention span. This was simultaneously posed as a reason for the long developmental period of human children. Both conclusions were drawn because of the strength of the artificial neural network which the project created.
The ability of infants to develop language has also been modeled using robots in order to test linguistic theories. Enabled to learn as children might, a model was created based on an affordance
model in which mappings between actions, perceptions, and effects were
created and linked to spoken words. Crucially, these robots were able
to acquire functioning word-to-meaning mappings without needing
grammatical structure, vastly simplifying the learning process and
shedding light on information which furthers the current understanding
of linguistic development. It is important to note that this
information could only have been empirically tested using a
computational approach.
As our understanding of the linguistic development of an
individual within a lifetime is continually improved using neural
networks and learning robotic systems,
it is also important to keep in mind that languages themselves change
and develop through time. Computational approaches to understanding
this phenomenon have unearthed very interesting information. Using the Price Equation and Pólya urn
dynamics, researchers have created a system which not only predicts
future linguistic evolution, but also gives insight into the
evolutionary history of modern-day languages. This modeling effort achieved, through computational linguistics, what would otherwise have been impossible.
It is clear that the understanding of linguistic development in
humans as well as throughout evolutionary time has been fantastically
improved because of advances in computational linguistics. The ability
to model and modify systems at will affords science an ethical method of
testing hypotheses that would otherwise be intractable.
Structural approaches
In
order to create better computational models of language, an
understanding of language's structure is crucial. To this end, the English language
has been meticulously studied using computational approaches to better
understand how the language works on a structural level. One of the
most important pieces of being able to study linguistic structure is the
availability of large linguistic corpora, or samples. This grants
computational linguists the raw data necessary to run their models and
gain a better understanding of the underlying structures present in the
vast amount of data which is contained in any single language. One of
the most cited English linguistic corpora is the Penn Treebank.
Derived from widely-different sources, such as IBM computer manuals
and transcribed telephone conversations, this corpus contains over 4.5
million words of American English. This corpus has been primarily
annotated using part-of-speech tagging and syntactic bracketing and has yielded substantial empirical observations related to language structure.
Theoretical approaches to the structure of languages have also
been developed. These works allow computational linguistics to have a
framework within which to work out hypotheses that will further the
understanding of the language in a myriad of ways. One of the original
theoretical theses on internalization of grammar and structure of language proposed two types of models. In these models, rules or patterns learned increase in strength with the frequency of their encounter.
The work also created a question for computational linguists to answer:
how does an infant learn a specific and non-normal grammar (Chomsky Normal Form) without learning an overgeneralized version and getting stuck?
Theoretical efforts like these set the direction for research to go
early in the lifetime of a field of study, and are crucial to the growth
of the field.
Structural information about languages allows for the discovery
and implementation of similarity recognition between pairs of text
utterances.
For instance, it has recently been proven that based on the structural
information present in patterns of human discourse, conceptual recurrence plots
can be used to model and visualize trends in data and create reliable
measures of similarity between natural textual utterances. This technique is a strong tool for further probing the structure of human discourse.
Without the computational approach to this question, the vastly
complex information present in discourse data would have remained
inaccessible to scientists.
Information regarding the structural data of a language is available for English as well as other languages, such as Japanese. Using computational methods, Japanese sentence corpora were analyzed and a pattern of log-normality was found in relation to sentence length.
Though the exact cause of this lognormality remains unknown, it is
precisely this sort of intriguing information which computational
linguistics is designed to uncover. This information could lead to
further important discoveries regarding the underlying structure of
Japanese, and could have any number of effects on the understanding of
Japanese as a language. Computational linguistics allows for very
exciting additions to the scientific knowledge base to happen quickly
and with very little room for doubt.
Without a computational approach to the structure of linguistic
data, much of the information that is available now would still be
hidden under the vastness of data within any single language.
Computational linguistics allows scientists to parse huge amounts of
data reliably and efficiently, creating the possibility for discoveries
unlike any seen in most other approaches.
Production approaches
The production of language is equally as complex in the information
it provides and the necessary skills which a fluent producer must have.
That is to say, comprehension is only half the problem of
communication. The other half is how a system produces language, and
computational linguistics has made some very interesting discoveries in
this area.
In a now famous paper published in 1950 Alan Turing
proposed the possibility that machines might one day have the ability
to "think". As a thought experiment for what might define the concept of
thought in machines, he proposed an "imitation test" in which a human
subject has two text-only conversations, one with a fellow human and
another with a machine attempting to respond like a human. Turing
proposes that if the subject cannot tell the difference between the
human and the machine, it may be concluded that the machine is capable
of thought. Today this test is known as the Turing test and it remains an influential idea in the area of artificial intelligence.
One of the earliest and best known examples of a computer program designed to converse naturally with humans is the ELIZA program developed by Joseph Weizenbaum at MIT in 1966. The program emulated a Rogerian psychotherapist
when responding to written statements and questions posed by a user. It
appeared capable of understanding what was said to it and responding
intelligently, but in truth it simply followed a pattern matching
routine that relied on only understanding a few keywords in each
sentence. Its responses were generated by recombining the unknown parts
of the sentence around properly translated versions of the known words.
For example, in the phrase "It seems that you hate me" ELIZA understands
"you" and "me" which matches the general pattern "you [some words] me",
allowing ELIZA to update the words "you" and "me" to "I" and "you" and
replying "What makes you think I hate you?". In this example ELIZA has
no understanding of the word "hate", but it is not required for a
logical response in the context of this type of psychotherapy.
Some projects are still trying to solve the problem which first
started computational linguistics off as its own field in the first
place. However, the methods have become more refined and clever, and
consequently the results generated by computational linguists have
become more enlightening. In an effort to improve computer translation,
several models have been compared, including hidden Markov models, smoothing techniques, and the specific refinements of those to apply them to verb translation. The model which was found to produce the most natural translations of German and French
words was a refined alignment model with a first-order dependence and a
fertility model. They also provide efficient training algorithms
for the models presented, which can give other scientists the ability to
improve further on their results. This type of work is specific to
computational linguistics, and has applications which could vastly
improve understanding of how language is produced and comprehended by
computers.
Work has also been done in making computers produce language in a
more naturalistic manner. Using linguistic input from humans,
algorithms have been constructed which are able to modify a system's
style of production based on a factor such as linguistic input from a
human, or more abstract factors like politeness or any of the five main dimensions of personality. This work takes a computational approach via parameter estimation
models to categorize the vast array of linguistic styles we see across
individuals and simplify it for a computer to work in the same way,
making human-computer interaction much more natural.
Text-based interactive approach
Many
of the earliest and simplest models of human-computer interaction, such
as ELIZA for example, involve a text-based input from the user to
generate a response from the computer. By this method, words typed by a
user trigger the computer to recognize specific patterns and reply
accordingly, through a process known as keyword spotting.
Speech-based interactive approach
Recent technologies have placed more of an emphasis on speech-based interactive systems. These systems, such as Siri of the iOS
operating system, operate on a similar pattern-recognizing technique as
that of text-based systems, but with the former, the user input is
conducted through speech recognition.
This branch of linguistics involves the processing of the user's speech
as sound waves and the interpreting of the acoustics and language
patterns in order for the computer to recognize the input.
Comprehension approaches
Much
of the focus of modern computational linguistics is on comprehension.
With the proliferation of the internet and the abundance of easily
accessible written human language, the ability to create a program
capable of understanding human language
would have many broad and exciting possibilities, including improved
search engines, automated customer service, and online education.
Early work in comprehension included applying Bayesian statistics
to the task of optical character recognition, as illustrated by Bledsoe
and Browing in 1959 in which a large dictionary of possible letters
were generated by "learning" from example letters and then the
probability that any one of those learned examples matched the new input
was combined to make a final decision.
Other attempts at applying Bayesian statistics to language analysis
included the work of Mosteller and Wallace (1963) in which an analysis
of the words used in The Federalist Papers was used to attempt to determine their authorship (concluding that Madison most likely authored the majority of the papers).
In 1971 Terry Winograd developed an early natural language processing
engine capable of interpreting naturally written commands within a
simple rule governed environment. The primary language parsing program
in this project was called SHRDLU,
which was capable of carrying out a somewhat natural conversation with
the user giving it commands, but only within the scope of the toy
environment designed for the task. This environment consisted of
different shaped and colored blocks, and SHRDLU was capable of
interpreting commands such as "Find a block which is taller than the one
you are holding and put it into the box." and asking questions such as
"I don't understand which pyramid you mean." in response to the user's
input. While impressive, this kind of natural language processing has proven much more difficult outside the limited scope of the toy environment. Similarly a project developed by NASA called LUNAR
was designed to provide answers to naturally written questions about
the geological analysis of lunar rocks returned by the Apollo missions. These kinds of problems are referred to as question answering.
Initial attempts at understanding spoken language were based on
work done in the 1960s and 1970s in signal modeling where an unknown
signal is analyzed to look for patterns and to make predictions based on
its history. An initial and somewhat successful approach to applying
this kind of signal modeling to language was achieved with the use of
hidden Markov models as detailed by Rabiner in 1989.
This approach attempts to determine probabilities for the arbitrary
number of models that could be being used in generating speech as well
as modeling the probabilities for various words generated from each of
these possible models. Similar approaches were employed in early speech recognition attempts starting in the late 70s at IBM using word/part-of-speech pair probabilities.
More recently these kinds of statistical approaches have been
applied to more difficult tasks such as topic identification using
Bayesian parameter estimation to infer topic probabilities in text
documents.
Applications
Modern
computational linguistics is often a combination of studies in computer
science and programming, math, particularly statistics, language
structures, and natural language processing. Combined, these fields most
often lead to the development of systems that can recognize speech and
perform some task based on that speech. Examples include speech
recognition software, such as Apple's Siri feature, spellcheck tools, speech synthesis
programs, which are often used to demonstrate pronunciation or help the
disabled, and machine translation programs and websites, such as Google
Translate.
Computational linguistics can be especially helpful in situations involving social media and the Internet.
For example, filters in chatrooms or on website searches require
computational linguistics. Chat operators often use filters to identify
certain words or phrases and deem them inappropriate so that users
cannot submit them.
Another example of using filters is on websites. Schools use filters so
that websites with certain keywords are blocked from children to view.
There are also many programs in which parents use Parental controls to put content filters in place. Computational linguists can also develop programs that group and organize content through Social media mining. An example of this is Twitter, in which programs can group tweets by subject or keywords.
Computational linguistics is also used for document retrieval and
clustering. When you do an online search, documents and websites are
retrieved based on the frequency of unique labels related to what you
typed into a search engine. For instance, if you search "red, large,
four-wheeled vehicle," with the intention of finding pictures of a red
truck, the search engine will still find the information desired by
matching words such as "four-wheeled" with "car".
Subfields
Computational
linguistics can be divided into major areas depending upon the medium
of the language being processed, whether spoken or textual; and upon the
task being performed, whether analyzing language (recognition) or synthesizing language (generation).
Speech recognition and speech synthesis
deal with how spoken language can be understood or created using
computers. Parsing and generation are sub-divisions of computational
linguistics dealing respectively with taking language apart and putting
it together. Machine translation remains the sub-division of
computational linguistics dealing with having computers translate
between languages. The possibility of automatic language translation,
however, has yet to be realized and remains a notoriously hard branch of
computational linguistics.
Some of the areas of research that are studied by computational linguistics include:
- Computational complexity of natural language, largely modeled on automata theory, with the application of context-sensitive grammar and linearly bounded Turing machines.
- Computational semantics comprises defining suitable logics for linguistic meaning representation, automatically constructing them and reasoning with them
- Computer-aided corpus linguistics, which has been used since the 1970s as a way to make detailed advances in the field of discourse analysis
- Design of parsers or chunkers for natural languages
- Design of taggers like POS-taggers (part-of-speech taggers)
- Machine translation as one of the earliest and most difficult applications of computational linguistics draws on many subfields.
- Simulation and study of language evolution in historical linguistics/glottochronology.
Legacy
The subject of computational linguistics has had a recurring impact on popular culture:
- The 1983 film WarGames features a young computer hacker who interacts with an artificially intelligent supercomputer.
- A 1997 film, Conceiving Ada, focuses on Ada Lovelace, considered one of the first computer scientists, as well as themes of computational linguistics.
- Her, a 2013 film, depicts a man's interactions with the "world's first artificially intelligent operating system."
- The 2014 film The Imitation Game follows the life of computer scientist Alan Turing, developer of the Turing Test.
- The 2015 film Ex Machina centers around human interaction with artificial intelligence.