Search This Blog

Saturday, November 24, 2018

Standard Generalized Markup Language

From Wikipedia, the free encyclopedia

Standard Generalized Markup Language
SGML.svg
Filename extension.sgml
Internet media typeapplication/sgml, text/sgml
Uniform Type Identifier (UTI)public.xml
Developed byISO
Type of formatMarkup Language
Extended fromGML
Extended toHTML, XML
StandardISO 8879

The Standard Generalized Markup Language (SGML; ISO 8879:1986) is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 defines generalized markup.

Generalized markup is based on two postulates:
  • Markup should be declarative: it should describe a document's structure and other attributes, rather than specify the processing to be performed on it. Declarative markup is less likely to conflict with unforeseen future processing needs and techniques;
  • Markup should be rigorous so that the techniques available for processing rigorously-defined objects like programs and databases can be used for processing documents as well.
HTML was theoretically an example of an SGML-based language until HTML 5, which browsers cannot parse as SGML for compatibility reasons.

DocBook SGML and LinuxDoc are examples which were used almost exclusively with actual SGML tools.

Standard versions

SGML is an ISO standard: "ISO 8879:1986 Information processing – Text and office systems – Standard Generalized Markup Language (SGML)", of which there are three versions:
  1. Original SGML, which was accepted in October 1986, followed by a minor Technical Corrigendum;
  2. SGML (ENR), in 1996, resulted from a Technical Corrigendum to add extended naming rules allowing arbitrary-language and -script markup;
  3. SGML (ENR+WWW or WebSGML), in 1998, resulted from a Technical Corrigendum to better support XML and WWW requirements.
SGML is part of a trio of enabling ISO standards for electronic documents developed by ISO/IEC JTC1/SC34 (ISO/IEC Joint Technical Committee 1, Subcommittee 34 – Document description and processing languages):
  • SGML (ISO 8879)—Generalized markup language
    • SGML was reworked in 1998 into XML, a successful profile of SGML. Full SGML is rarely found or used in new projects;
  • DSSSL (ISO/IEC 10179)—Document processing and styling language based on Scheme.
  • HyTime—Generalized hypertext and scheduling.
    • HyTime was partially reworked into W3C XLink. HyTime is rarely used in new projects.
SGML is supported by various technical reports, in particular:
  • ISO/IEC TR 9573 – Information processing – SGML support facilities – Techniques for using SGML
    • Part 13: Public entity sets for mathematics and science
      • In 2007, the W3C MathML working group agreed to assume the maintenance of these entity sets.

History

SGML descended from IBM's Generalized Markup Language (GML), which Charles Goldfarb, Edward Mosher, and Raymond Lorie developed in the 1960s. Goldfarb, editor of the international standard, coined the “GML” term using their surname initials. Goldfarb also wrote the definitive work on SGML syntax in "The SGML Handbook". The syntax of SGML is closer to the COCOA format. As a document markup language, SGML was originally designed to enable the sharing of machine-readable large-project documents in government, law, and industry. Many such documents must remain readable for several decades—a long time in the information technology field. SGML also was extensively applied by the military, and the aerospace, technical reference, and industrial publishing industries. The advent of the XML profile has made SGML suitable for widespread application for small-scale, general-purpose use.

A fragment of the Oxford English Dictionary (1985), showing SGML markup

Document validity

SGML (ENR+WWW) defines two kinds of validity. According to the revised Terms and Definitions of ISO 8879 (from the public draft):
A conforming SGML document must be either a type-valid SGML document, a tag-valid SGML document, or both. Note: A user may wish to enforce additional constraints on a document, such as whether a document instance is integrally-stored or free of entity references.
A type-valid SGML document is defined by the standard as:
An SGML document in which, for each document instance, there is an associated document type declaration (DTD) to whose DTD that instance conforms.
A tag-valid SGML document is defined by the standard as:
An SGML document, all of whose document instances are fully tagged. There need not be a document type declaration associated with any of the instances. Note: If there is a document type declaration, the instance can be parsed with or without reference to it.

Terminology

Tag-validity was introduced in SGML (ENR+WWW) to support XML which allows documents with no DOCTYPE declaration but which can be parsed without a grammar or documents which have a DOCTYPE declaration that makes no XML Infoset contributions to the document. The standard calls this fully tagged. Integrally stored reflects the XML requirement that elements end in the same entity in which they started. Reference-free reflects the HTML requirement that entity references are for special characters and do not contain markup. SGML validity commentary, especially commentary that was made before 1997 or that is unaware of SGML (ENR+WWW), covers type-validity only.
The SGML emphasis on validity supports the requirement for generalized markup that markup should be rigorous. (ISO 8879 A.1)

Syntax

An SGML document may have three parts:
  1. the SGML Declaration;
  2. the Prologue, containing a DOCTYPE declaration with the various markup declarations that together make a Document Type Definition (DTD); and
  3. the instance itself, containing one top-most element and its contents.
An SGML document may be composed from many entities (discrete pieces of text). In SGML, the entities and element types used in the document may be specified with a DTD, the different character sets, features, delimiter sets, and keywords are specified in the SGML Declaration to create the concrete syntax of the document.

Although full SGML allows implicit markup and some other kinds of tags, the XML specification (s4.3.1) states:
Each XML document has both a logical and a physical structure. Physically, the document is composed of units called entities. An entity may refer to other entities to cause their inclusion in the document. A document begins in a "root" or document entity. Logically, the document is composed of declarations, elements, comments, character references, and processing instructions, all of which are indicated in the document by explicit markup.
For introductory information on a basic, modern SGML syntax, see XML. The following material concentrates on features not in XML and is not a comprehensive summary of SGML syntax.

Optional features

SGML generalizes and supports a wide range of markup languages as found in the mid 1980s. These ranged from terse Wiki-like syntaxes to RTF-like bracketed languages to HTML-like matching-tag languages. SGML did this by a relatively simple default reference concrete syntax augmented with a large number of optional features that could be enabled in the SGML Declaration. Not every SGML parser can necessarily process every SGML document. Because each processor's System Declaration can be compared to the document's SGML Declaration it is always possible to know whether a document is supported by a particular processor.

Many SGML features relate to markup minimization. Other features relate to concurrent (parallel) markup (CONCUR), to linking processing attributes (LINK), and to embedding SGML documents within SGML documents (SUBDOC).

The notion of customizable features was not appropriate for Web use, so one goal of XML was to minimize optional features. However XML's well-formedness rules cannot support Wiki-like languages, leaving them unstandardized and difficult to integrate with non-text information systems.

Concrete and abstract syntaxes

The usual (default) SGML concrete syntax resembles this example, which is the default HTML concrete syntax:

 TYPE="example">
  typically something like this

SGML provides an abstract syntax that can be implemented in many different types of concrete syntax. Although the markup norm is using angle brackets as start- and end- tag delimiters in an SGML document (per the standard-defined reference concrete syntax), it is possible to use other characters—provided a suitable concrete syntax is defined in the document's SGML declaration. For example, an SGML interpreter might be programmed to parse GML, wherein the tags are delimited with a left colon and a right full stop, thus, an :e prefix denotes an end tag: :xmp.Hello, world:exmp.. According to the reference syntax, letter-case (upper- or lower-) is not distinguished in tag names, thus the three tags: (i) , (ii) , and (iii) are equivalent. (NOTE: A concrete syntax might change this rule via the NAMECASE NAMING declarations).

Markup minimization

SGML has features for reducing the number of characters required to mark up a document, which must be enabled in the SGML Declaration. SGML processors need not support every available feature, thus allowing applications to tolerate many types of inadvertent markup omissions; however, SGML systems usually are intolerant of invalid structures. XML is intolerant of syntax omissions, and does not require a DTD for checking well-formedness.

OMITTAG

Both start tags and end tags may be omitted from a document instance, provided:
  • The OMITTAG feature is enabled in the SGML Declaration;
  • The DTD indicates that the tags are permitted to be omitted;
  • (for start tags) the element has no associated required (#REQUIRED) attributes; and
  • The tag can be unambiguously inferred by context.
For example, if OMITTAG YES is specified in the SGML Declaration (enabling the OMITTAG feature), and the DTD includes the following declarations:

 chapter - - (title, section+)>
 title o o (#PCDATA)>
 section - - (title, subsection+)>
then this excerpt:
Introduction to SGML
The SGML Declaration ...

which omits two </code> tags and two <code class="nowrap" style=""> tags, would represent valid markup.
Note also that omitting tags is optional – the same excerpt could be tagged like this:

</span>Introduction to SGML<span class="nt">
</span>The SGML Declaration<span class="nt">
...

and would still represent valid markup.

Note: The OMITTAG feature is unrelated to the tagging of elements whose declared content is EMPTY as defined in the DTD:

 image - o EMPTY>

Elements defined like this have no end tag, and specifying one in the document instance would result in invalid markup. This is syntactically different than XML empty elements in this regard.

SHORTREF

Tags can be replaced with delimiter strings, for a terser markup, via the SHORTREF feature. This markup style is now associated with wiki markup, e.g. wherein two equals-signs (==), at the start of a line, are the “heading start-tag”, and two equals signs (==) after that are the “heading end-tag”.

SHORTTAG

SGML markup languages whose concrete syntax enables the SHORTTAG VALUE feature, do not require attribute values containing only alphanumeric characters to be enclosed within quotation marks—either double " " (LIT) or single ' ' (LITA)—so that the previous markup example could be written:

 TYPE=example>
  typically something like this</>


One feature of SGML markup languages is the "presumptuous empty tagging", such that the empty end tag </> in this</> "inherits" its value from the nearest previous full start tag, which, in this example, is (in other words, it closes the most recently opened item). The expression is thus equivalent to this.

NET

Another feature is the NET (Null End Tag) construction: /, which is structurally equivalent to this.

Other features

Additionally, the SHORTTAG NETENABL IMMEDNET feature allows shortening tags surrounding an empty text value, but forbids shortening full tags:

>

can be written as
 


wherein the first slash ( / ) stands for the NET-enabling “start-tag close” (NESTC), and the second slash stands for the NET. NOTE: XML defines NESTC with a /, and NET with an > (angled bracket)—hence the corresponding construct in XML appears as .

The third feature is 'text on the same line', allowing a markup item to be ended with a line-end; especially useful for headings and such, requiring using either SHORTREF or DATATAG minimization. For example, if the DTD includes the following declarations:
 
 lines (line*)
 line O - (#PCDATA)>
   line-tagc  "
"> one-line "&#RE;&#RS;" line-tagc> one-line line>
(and "&#RE;&#RS;" is a short-reference delimiter in the concrete syntax), then:
 

first line
second line

is equivalent to:
 

first line
second line

Formal characterization

SGML has many features that defied convenient description with the popular formal automata theory and the contemporary parser technology of the 1980s and the 1990s. The standard warns in Annex H:
The SGML model group notation was deliberately designed to resemble the regular expression notation of automata theory, because automata theory provides a theoretical foundation for some aspects of the notion of conformance to a content model. No assumption should be made about the general applicability of automata to content models.
A report on an early implementation of a parser for basic SGML, the Amsterdam SGML Parser, notes
the DTD-grammar in SGML must conform to a notion of unambiguity which closely resembles the LL(1) conditions
and specifies various differences.

There appears to be no definitive classification of full SGML against a known class of formal grammar. Plausible classes may include tree-adjoining grammars and adaptive grammars.

XML is described as being generally parsable like a two-level grammar for non-validated XML and a Conway-style pipeline of coroutines (lexer, parser, validator) for valid XML. The SGML productions in the ISO standard are reported to be LL(3) or LL(4). XML-class subsets are reported to be expressible using a W-grammar. According to one paper, and probably considered at an information set or parse tree level rather than a character or delimiter level:
The class of documents that conform to a given SGML document grammar forms an LL(1) language. … The SGML document grammars by themselves are, however, not LL(1) grammars.
The SGML standard does not define SGML with formal data structures, such as parse trees, however, an SGML document is constructed of a rooted directed acyclic graph (RDAG) of physical storage units known as “entities”, which is parsed into a RDAG of structural units known as “elements”. The physical graph is loosely characterized as an entity tree, but entities might appear multiple times. Moreover, the structure graph is also loosely characterized as an element tree, but the ID/IDREF markup allows arbitrary arcs.

The results of parsing can also be understood as a data tree in different notations; where the document is the root node, and entities in other notations (text, graphics) are child nodes. SGML provides apparatus for linking to and annotating external non-SGML entities.

The SGML standard describes it in terms of maps and recognition modes (s9.6.1). Each entity, and each element, can have an associated notation or declared content type, which determines the kinds of references and tags which will be recognized in that entity and element. Also, each element can have an associated delimiter map (and short reference map), which determines which characters are treated as delimiters in context. The SGML standard characterizes parsing as a state machine switching between recognition modes. During parsing, there is a stack of maps that configure the scanner, while the tokenizer relates to the recognition modes.

Parsing involves traversing the dynamically-retrieved entity graph, finding/implying tags and the element structure, and validating those tags against the grammar. An unusual aspect of SGML is that the grammar (DTD) is used both passively — to recognize lexical structures, and actively — to generate missing structures and tags that the DTD has declared optional. End- and start- tags can be omitted, because they can be inferred. Loosely, a series of tags can be omitted only if there is a single, possible path in the grammar to imply them. It was this active use of grammars that made concrete SGML parsing difficult to formally characterize.

SGML uses the term validation for both recognition and generation. XML does not use the grammar (DTD) to change delimiter maps or to inform the parse modes, and does not allow tag omission; consequently, XML validation of elements is not active in the sense that SGML validation is active. SGML without a DTD (e.g. simple XML), is a grammar or a language; SGML with a DTD is a metalanguage. SGML with an SGML declaration is, perhaps, a meta-metalanguage, since it is a metalanguage whose declaration mechanism is a metalanguage.

SGML has an abstract syntax implemented by many possible concrete syntaxes, however, this is not the same usage as in an abstract syntax tree and as in a concrete syntax tree. In the SGML usage, a concrete syntax is a set of specific delimiters, while the abstract syntax is the set of names for the delimiters. The XML Infoset corresponds more to the programming language notion of abstract syntax introduced by John McCarthy.

Derivatives

XML

The W3C XML (Extensible Markup Language) is a profile (subset) of SGML designed to ease the implementation of the parser compared to a full SGML parser, primarily for use on the World Wide Web. In addition to disabling many SGML options present in the reference syntax (such as omitting tags and nested subdocuments) XML adds a number of additional restrictions on the kinds of SGML syntax. For example, despite enabling SGML shortened tag forms, XML does not allow unclosed start or end tags. It also relied on many of the additions made by the WebSGML Annex. XML currently is more widely used than full SGML. XML has lightweight internationalization based on Unicode. Applications of XML include XHTML, XQuery, XSLT, XForms, XPointer, JSP, SVG, RSS, Atom, XML-RPC, RDF/XML, and SOAP.

HTML

While HTML was developed partially independently and in parallel with SGML, its creator, Tim Berners-Lee, intended it to be an application of SGML. The design of HTML (Hyper Text Markup Language) was therefore inspired by SGML tagging, but, since no clear expansion and parsing guidelines were established, most actual HTML documents are not valid SGML documents. Later, HTML was reformulated (version 2.0) to be more of an SGML application, however, the HTML markup language has many legacy- and exception- handling features that differ from SGML's requirements. HTML 4 is an SGML application that fully conforms to ISO 8879 – SGML.

The charter for the 2006 revival of the World Wide Web Consortium HTML Working Group says, "the Group will not assume that an SGML parser is used for 'classic HTML'". Although HTML syntax closely resembles SGML syntax with the default reference concrete syntax, HTML5 abandons any attempt to define HTML as an SGML application, explicitly defining its own parsing rules, which more closely match existing implementations and documents. It does, however, define an alternative XHTML serialization, which conforms to XML and therefore to SGML as well.
 

OED

The second edition of the Oxford English Dictionary (OED) is entirely marked up with an SGML-based markup language.

The third edition is marked up as XML.

Others

Other document markup languages are partly related to SGML and XML, but — because they cannot be parsed or validated or other-wise processed using standard SGML and XML tools — they are not considered either SGML or XML languages; the Z Format markup language for typesetting and documentation is an example.

Several modern programming languages support tags as primitive token types, or now support Unicode and regular expression pattern-matching. An example is the Scala programming language.

Applications

Document markup languages defined using SGML are called "applications" by the standard; many pre-XML SGML applications were proprietary property of the organizations which developed them, and thus unavailable in the World Wide Web. The following list is of pre-XML SGML applications.
  • TEI (Text Encoding Initiative) is an academic consortium that designs, maintains, and develops technical standards for digital-format textual representation applications.
  • DocBook is a markup language originally created as an SGML application, designed for authoring technical documentation; DocBook currently is an XML application.
  • CALS (Continuous Acquisition and Life-cycle Support) is a US Department of Defense (DoD) initiative for electronically capturing military documents and for linking related data and information.
  • HyTime defines a set of hypertext-oriented element types that allow SGML document authors to build hypertext and multimedia presentations.
  • EDGAR (Electronic Data-Gathering, Analysis, and Retrieval) system effects automated collection, validation, indexing, acceptance, and forwarding of submissions, by companies and others, who are legally required to file data and information forms with the US Securities and Exchange Commission (SEC).
  • LinuxDoc. Documentation for Linux packages has used the LinuxDoc SGML DTD and Docbook XML DTD.
  • AAP DTD is a document type definition for scientific documents, defined by the Association of American Publishers.
  • SGMLguid was an early SGML document type definition created, developed and used at CERN.

Open-source implementations

Significant open-source implementations of SGML have included:
  • ASP-SGML
  • ARC-SGML, by Standard Generalized Markup Language Users', 1991, C language
  • SGMLS, by James Clark, 1993, C language
  • Project YAO, by Yuan-ze Institute of Technology, Taiwan, with Charles Goldfarb, 1994, object
  • SP by James Clark, C++ language
SP and Jade, the associated DSSSL processors, are maintained by the OpenJade project, and are common parts of Linux distributions. A general archive of SGML software and materials resides at SUNET. The original HTML parser class, in Sun System's implementation of Java, is a limited-features SGML parser, using SGML terminology and concepts.

Markup language

From Wikipedia, the free encyclopedia

Example of RecipeBook, a simple language based on XML for creating recipes. The markup can be converted to HTML, PDF and Rich Text Format using a programming langage or XSL.

In computer text processing, a markup language is a system for annotating a document in a way that is syntactically distinguishable from the text. The idea and terminology evolved from the "marking up" of paper manuscripts, i.e., the revision instructions by editors, traditionally written with a blue pencil on authors' manuscripts. In digital media, this "blue pencil instruction text" was replaced by tags, that is, instructions are expressed directly by tags or "instruction text encapsulated by tags." However the whole idea of a mark up language is to avoid the formatting work for the text, as the tags in the mark up language serve the purpose to format the appropriate text (like a header or the beginning of a paragraph etc.). Every tag used in a Markup language has a property to format the text we write.

Examples include typesetting instructions such as those found in troff, TeX and LaTeX, or structural markers such as XML tags. Markup instructs the software that displays the text to carry out appropriate actions, but is omitted from the version of the text that users see.

Some markup languages, such as the widely used HTML, have pre-defined presentation semantics—meaning that their specification prescribes how to present the structured data. Others, such as XML, do not have them and are general purpose.

HyperText Markup Language (HTML), one of the document formats of the World Wide Web, is an instance of Standard Generalized Markup Language or SGML, and follows many of the markup conventions used in the publishing industry in the communication of printed work between authors, editors, and printers.

Etymology

The term markup is derived from the traditional publishing practice of "marking up" a manuscript, which involves adding handwritten annotations in the form of conventional symbolic printer's instructions in the margins and text of a paper manuscript or printed. It is computer jargon used in coding proof. For centuries, this task was done primarily by skilled typographers known as "markup men" or "d markers" who marked up text to indicate what typeface, style, and size should be applied to each part, and then passed the manuscript to others for typesetting by hand. Markup was also commonly applied by editors, proofreaders, publishers, and graphic designers, and indeed by document authors.

Types of markup language

There are three main general categories of electronic markup:
Presentational markup
The kind of markup used by traditional word-processing systems: binary codes embedded within document text that produce the WYSIWYG ("what you see is what you get") effect. Such markup is usually hidden from human users, even authors or editors.
Procedural markup
Markup is embedded in text and provides instructions for programs that are to process the text. Well-known examples include troff, TeX, and PostScript. It is expected that the processor will run through the text from beginning to end, following the instructions as encountered. Text with such markup is often edited with the markup visible and directly manipulated by the author. Popular procedural-markup systems usually include programming constructs, so macros or subroutines can be defined and invoked by name.
Descriptive markup
Markup is used to label parts of the document rather than to provide specific instructions as to how they should be processed. Well-known examples include LaTeX, HTML, and XML. The objective is to decouple the inherent structure of the document from any particular treatment or rendition of it. Such markup is often described as "semantic". An example of descriptive markup would be HTML's tag, which is used to label a citation. Descriptive markup—sometimes called logical markup or conceptual markup—encourages authors to write in a way that describes the material conceptually, rather than visually.
There is considerable blurring of the lines between the types of markup. In modern word-processing systems, presentational markup is often saved in descriptive-markup-oriented systems such as XML, and then processed procedurally by implementations. The programming in procedural-markup systems such as TeX may be used to create higher-level markup systems that are more descriptive, such as LaTeX.

In recent years, a number of small and largely unstandardized markup languages have been developed to allow authors to create formatted text via web browsers, for use in wikis and web forums. These are sometimes called lightweight markup languages. Markdown or the markup language used by Wikipedia are examples of such wiki markup.

History of markup languages

GenCode

The first well-known public presentation of markup languages in computer text processing was made by William W. Tunnicliffe at a conference in 1967, although he preferred to call it generic coding. It can be seen as a response to the emergence of programs such as RUNOFF that each used their own control notations, often specific to the target typesetting device. In the 1970s, Tunnicliffe led the development of a standard called GenCode for the publishing industry and later was the first chair of the International Organization for Standardization committee that created SGML, the first standard descriptive markup language. Book designer Stanley Rice published speculation along similar lines in 1970. Brian Reid, in his 1980 dissertation at Carnegie Mellon University, developed the theory and a working implementation of descriptive markup in actual use.

However, IBM researcher Charles Goldfarb is more commonly seen today as the "father" of markup languages. Goldfarb hit upon the basic idea while working on a primitive document management system intended for law firms in 1969, and helped invent IBM GML later that same year. GML was first publicly disclosed in 1973.

In 1975, Goldfarb moved from Cambridge, Massachusetts to Silicon Valley and became a product planner at the IBM Almaden Research Center. There, he convinced IBM's executives to deploy GML commercially in 1978 as part of IBM's Document Composition Facility product, and it was widely used in business within a few years.

SGML, which was based on both GML and GenCode, was developed by Goldfarb in 1974. Goldfarb eventually became chair of the SGML committee. SGML was first released by ISO as the ISO 8879 standard in October 1986.

troff and nroff

Some early examples of computer markup languages available outside the publishing industry can be found in typesetting tools on Unix systems such as troff and nroff. In these systems, formatting commands were inserted into the document text so that typesetting software could format the text according to the editor's specifications. It was a trial and error iterative process to get a document printed correctly. Availability of WYSIWYG ("what you see is what you get") publishing software supplanted much use of these languages among casual users, though serious publishing work still uses markup to specify the non-visual structure of texts, and WYSIWYG editors now usually save documents in a markup-language-based format.

TeX

Another major publishing standard is TeX, created and refined by Donald Knuth in the 1970s and '80s. TeX concentrated on detailed layout of text and font descriptions to typeset mathematical books. This required Knuth to spend considerable time investigating the art of typesetting. TeX is mainly used in academia, where it is a de facto standard in many scientific disciplines. A TeX macro package known as LaTeX provides a descriptive markup system on top of TeX, and is widely used.

Scribe, GML and SGML

The first language to make a clean distinction between structure and presentation was Scribe, developed by Brian Reid and described in his doctoral thesis in 1980. Scribe was revolutionary in a number of ways, not least that it introduced the idea of styles separated from the marked up document, and of a grammar controlling the usage of descriptive elements. Scribe influenced the development of Generalized Markup Language (later SGML) and is a direct ancestor to HTML and LaTeX.

In the early 1980s, the idea that markup should be focused on the structural aspects of a document and leave the visual presentation of that structure to the interpreter led to the creation of SGML. The language was developed by a committee chaired by Goldfarb. It incorporated ideas from many different sources, including Tunnicliffe's project, GenCode. Sharon Adler, Anders Berglund, and James A. Marke were also key members of the SGML committee.

SGML specified a syntax for including the markup in documents, as well as one for separately describing what tags were allowed, and where (the Document Type Definition (DTD) or schema). This allowed authors to create and use any markup they wished, selecting tags that made the most sense to them and were named in their own natural languages. Thus, SGML is properly a meta-language, and many particular markup languages are derived from it. From the late '80s on, most substantial new markup languages have been based on SGML system, including for example TEI and DocBook. SGML was promulgated as an International Standard by International Organization for Standardization, ISO 8879, in 1986.

SGML found wide acceptance and use in fields with very large-scale documentation requirements. However, many found it cumbersome and difficult to learn—a side effect of its design attempting to do too much and be too flexible. For example, SGML made end tags (or start-tags, or even both) optional in certain contexts, because its developers thought markup would be done manually by overworked support staff who would appreciate saving keystrokes.

HTML

In 1989, computer scientist Sir Tim Berners-Lee wrote a memo proposing an Internet-based hypertext system, then specified HTML and wrote the browser and server software in the last part of 1990. The first publicly available description of HTML was a document called "HTML Tags", first mentioned on the Internet by Berners-Lee in late 1991. It describes 18 elements comprising the initial, relatively simple design of HTML. Except for the hyperlink tag, these were strongly influenced by SGMLguid, an in-house SGML-based documentation format at CERN. Eleven of these elements still exist in HTML 4.

Berners-Lee considered HTML an SGML application. The Internet Engineering Task Force (IETF) formally defined it as such with the mid-1993 publication of the first proposal for an HTML specification: "Hypertext Markup Language (HTML)" Internet-Draft by Berners-Lee and Dan Connolly, which included an SGML Document Type Definition to define the grammar. Many of the HTML text elements are found in the 1988 ISO technical report TR 9537 Techniques for using SGML, which in turn covers the features of early text formatting languages such as that used by the RUNOFF command developed in the early 1960s for the CTSS (Compatible Time-Sharing System) operating system. These formatting commands were derived from those used by typesetters to manually format documents. Steven DeRose argues that HTML's use of descriptive markup (and influence of SGML in particular) was a major factor in the success of the Web, because of the flexibility and extensibility that it enabled. HTML became the main markup language for creating web pages and other information that can be displayed in a web browser, and is quite likely the most used markup language in the world today.

XML

XML (Extensible Markup Language) is a meta markup language that is now widely used. XML was developed by the World Wide Web Consortium, in a committee created and chaired by Jon Bosak. The main purpose of XML was to simplify SGML by focusing on a particular problem—documents on the Internet. XML remains a meta-language like SGML, allowing users to create any tags needed (hence "extensible") and then describing those tags and their permitted uses.

XML adoption was helped because every XML document can be written in such a way that it is also an SGML document, and existing SGML users and software could switch to XML fairly easily. However, XML eliminated many of the more complex and human-oriented features of SGML to simplify implementation environments such as documents and publications. However, it appeared to strike a happy medium between simplicity and flexibility, and was rapidly adopted for many other uses. XML is now widely used for communicating data between applications.

XHTML

Since January 2000, all W3C Recommendations for HTML have been based on XML rather than SGML, using the abbreviation XHTML (Extensible HyperText Markup Language). The language specification requires that XHTML Web documents must be well-formed XML documents. This allows for more rigorous and robust documents while using tags familiar from HTML.

One of the most noticeable differences between HTML and XHTML is the rule that all tags must be closed: empty HTML tags such as
must either be closed with a regular end-tag, or replaced by a special form: 
(the space before the '/' on the end tag is optional, but frequently used because it enables some pre-XML Web browsers, and SGML parsers, to accept the tag). Another is that all attribute values in tags must be quoted. Finally, all tag and attribute names within the XHTML namespace must be lowercase to be valid. HTML, on the other hand, was case-insensitive.

Other XML-based applications

Many XML-based applications now exist, including the Resource Description Framework as RDF/XML, XForms, DocBook, SOAP, and the Web Ontology Language (OWL).

Features of markup languages

A common feature of many markup languages is that they intermix the text of a document with markup instructions in the same data stream or file. This is not necessary; it is possible to isolate markup from text content, using pointers, offsets, IDs, or other methods to co-ordinate the two. Such "standoff markup" is typical for the internal representations that programs use to work with marked-up documents. However, embedded or "inline" markup is much more common elsewhere. Here, for example, is a small section of text marked up in HTML:

<h1>Anatidae</h1>
<p>
The family <i>Anatidae</i> includes ducks, geese, and swans,
but <em>not</em> the closely related screamers.
</p>

The codes enclosed in angle-brackets are markup instructions (known as tags), while the text between these instructions is the actual text of the document. The codes h1, p, and em are examples of semantic markup, in that they describe the intended purpose or meaning of the text they include. Specifically, h1 means "this is a first-level heading", p means "this is a paragraph", and em means "this is an emphasized word or phrase". A program interpreting such structural markup may apply its own rules or styles for presenting the various pieces of text, using different typefaces, boldness, font size, indentation, colour, or other styles, as desired. A tag such as "h1" (header level 1) might be presented in a large bold sans-serif typeface, for example, or in a monospaced (typewriter-style) document it might be underscored – or it might not change the presentation at all.

In contrast, the i tag in HTML is an example of presentational markup; it is generally used to specify a particular characteristic of the text (in this case, the use of an italic typeface) without specifying the reason for that appearance.

The Text Encoding Initiative (TEI) has published extensive guidelines[19] for how to encode texts of interest in the humanities and social sciences, developed through years of international cooperative work. These guidelines are used by projects encoding historical documents, the works of particular scholars, periods, or genres, and so on.

Alternative usages

While the idea of markup language originated with text documents, there is increasing use of markup languages in the presentation of other types of information, including playlists, vector graphics, web services, content syndication, and user interfaces. Most of these are XML applications, because XML is a well-defined and extensible language.

The use of XML has also led to the possibility of combining multiple markup languages into a single profile, like XHTML+SMIL and XHTML+MathML+SVG.

Because markup languages, and more generally data description languages (not necessarily textual markup), are not programming languages (they are data without instructions), they are more easily manipulated than programming languages—for example, web pages are presented as HTML documents, not C code, and thus can be embedded within other web pages, displayed when only partially received, and so forth. This leads to the web design principle of the rule of least power, which advocates using the least (computationally) powerful language that satisfies a task to facilitate such manipulation and reuse.

Universal grammar

From Wikipedia, the free encyclopedia

Universal grammar (UG) in linguistics, is the theory of the genetic component of the language faculty, usually credited to Noam Chomsky. The basic postulate of UG is that a certain set of structural rules are innate to humans, independent of sensory experience. With more linguistic stimuli received in the course of psychological development, children then adopt specific syntactic rules that conform to UG. It is sometimes known as "mental grammar", and stands contrasted with other "grammars", e.g. prescriptive, descriptive and pedagogical. The advocates of this theory emphasize and partially rely on the poverty of the stimulus (POS) argument and the existence of some universal properties of natural human languages. However, the latter has not been firmly established, as some linguists have argued languages are so diverse that such universality is rare. It is a matter of empirical investigation to determine precisely what properties are universal and what linguistic capacities are innate.

Argument

The theory of universal grammar proposes that if human beings are brought up under normal conditions (not those of extreme sensory deprivation), then they will always develop language with certain properties (e.g., distinguishing nouns from verbs, or distinguishing function words from content words). The theory proposes that there is an innate, genetically determined language faculty that knows these rules, making it easier and faster for children to learn to speak than it otherwise would be. This faculty does not know the vocabulary of any particular language (so words and their meanings must be learned), and there remain several parameters which can vary freely among languages (such as whether adjectives come before or after nouns) which must also be learned.

As Chomsky puts it, "Evidently, development of language in the individual must involve three factors: (1) genetic endowment, which sets limits on the attainable languages, thereby making language acquisition possible; (2) external data, converted to the experience that selects one or another language within a narrow range; (3) principles not specific to the Faculty of Language."

Occasionally, aspects of universal grammar seem describable in terms of general details regarding cognition. For example, if a predisposition to categorize events and objects as different classes of things is part of human cognition, and directly results in nouns and verbs showing up in all languages, then it could be assumed that rather than this aspect of universal grammar being specific to language, it is more generally a part of human cognition. To distinguish properties of languages that can be traced to other facts regarding cognition from properties of languages that cannot, the abbreviation UG* can be used. UG is the term often used by Chomsky for those aspects of the human brain which cause language to be the way that it is (i.e. are universal grammar in the sense used here) but here for discussion, it is used for those aspects which are furthermore specific to language (thus UG, as Chomsky uses it, is just an abbreviation for universal grammar, but UG* as used here is a subset of universal grammar).

In the same article, Chomsky casts the theme of a larger research program in terms of the following question: "How little can be attributed to UG while still accounting for the variety of 'I-languages' attained, relying on third factor principles?" (I-languages meaning internal languages, the brain states that correspond to knowing how to speak and understand a particular language, and third factor principles meaning (3) in the previous quote).

Chomsky has speculated that UG might be extremely simple and abstract, for example only a mechanism for combining symbols in a particular way, which he calls "merge". The following quote shows that Chomsky does not use the term "UG" in the narrow sense UG* suggested above:
"The conclusion that merge falls within UG holds whether such recursive generation is unique to FL (faculty of language) or is appropriated from other systems."

In other words, merge is seen as part of UG because it causes language to be the way it is, universal, and is not part of the environment or general properties independent of genetics and environment. Merge is part of universal grammar whether it is specific to language, or whether, as Chomsky suggests, it is also used for an example in mathematical thinking.

The distinction is important because there is a long history of argument about UG*, whereas most people working on language agree that there is universal grammar. Many people assume that Chomsky means UG* when he writes UG (and in some cases he might actually mean UG* [though not in the passage quoted above]).

Some students of universal grammar study a variety of grammars to extract generalizations called linguistic universals, often in the form of "If X holds true, then Y occurs." These have been extended to a variety of traits, such as the phonemes found in languages, the word orders which languages choose, and the reasons why children exhibit certain linguistic behaviors.

Later linguists who have influenced this theory include Chomsky and Richard Montague, developing their version of this theory as they considered issues of the argument from poverty of the stimulus to arise from the constructivist approach to linguistic theory. The application of the idea of universal grammar to the study of second language acquisition (SLA) is represented mainly in the work of McGill linguist Lydia White.

Syntacticians generally hold that there are parametric points of variation between languages, although heated debate occurs over whether UG constraints are essentially universal due to being "hard-wired" (Chomsky's principles and parameters approach), a logical consequence of a specific syntactic architecture (the generalized phrase structure approach) or the result of functional constraints on communication (the functionalist approach).

Relation to the evolution of language

In an article titled, "The Faculty of Language: What Is It, Who Has It, and How Did It Evolve?" Hauser, Chomsky, and Fitch present the three leading hypotheses for how language evolved and brought humans to the point where they have a universal grammar.

The first hypothesis states that the faculty of language in the broad sense (FLb) is strictly homologous to animal communication. This means that homologous aspects of the faculty of language exist in non-human animals.

The second hypothesis states that the FLb is a derived, uniquely human, adaptation for language. This hypothesis holds that individual traits were subject to natural selection and came to be specialized for humans.

The third hypothesis states that only the faculty of language in the narrow sense (FLn) is unique to humans. It holds that while mechanisms of the FLb are present in both human and non-human animals, the computational mechanism of recursion is recently evolved solely in humans. This is the hypothesis which most closely aligns to the typical theory of universal grammar championed by Chomsky.

History

The idea of a universal grammar can be traced back to Roger Bacon's observations in his c. 1245 Overview of Grammar and c. 1268 Greek Grammar that all languages are built upon a common grammar, even though it may undergo incidental variations; and the 13th century speculative grammarians who, following Bacon, postulated universal rules underlying all grammars. The concept of a universal grammar or language was at the core of the 17th century projects for philosophical languages. There is a Scottish school of universal grammarians from the 18th century, as distinguished from the philosophical language project, which included authors such as James Beattie, Hugh Blair, James Burnett, James Harris, and Adam Smith. The article on grammar in the first edition of the Encyclopædia Britannica (1771) contains an extensive section titled "Of Universal Grammar".

The idea rose to prominence and influence, in modern linguistics with theories from Chomsky and Montague in the 1950s–1970s, as part of the "linguistics wars".

During the early 20th century, in contrast, language was usually understood from a behaviourist perspective, suggesting that language acquisition, like any other kind of learning, could be explained by a succession of trials, errors, and rewards for success. In other words, children learned their mother tongue by simple imitation, through listening and repeating what adults said. For example, when a child says "milk" and the mother will smile and give her some as a result, the child will find this outcome rewarding, thus enhancing the child's language development.

Chomsky's theory

Chomsky argued that the human brain contains a limited set of constraints for organizing language. This implies in turn that all languages have a common structural basis: the set of rules known as "universal grammar".
Speakers proficient in a language know which expressions are acceptable in their language and which are unacceptable. The key puzzle is how speakers come to know these restrictions of their language, since expressions that violate those restrictions are not present in the input, indicated as such. Chomsky argued that this poverty of stimulus means that Skinner's behaviourist perspective cannot explain language acquisition. The absence of negative evidence—evidence that an expression is part of a class of ungrammatical sentences in a given language—is the core of his argument. For example, in English, an interrogative pronoun like what cannot be related to a predicate within a relative clause:
*"What did John meet a man who sold?"
Such expressions are not available to language learners: they are, by hypothesis, ungrammatical. Speakers of the local language do not use them, or note them as unacceptable to language learners. Universal grammar offers an explanation for the presence of the poverty of the stimulus, by making certain restrictions into universal characteristics of human languages. Language learners are consequently never tempted to generalize in an illicit fashion.

Presence of creole languages

The presence of creole languages is sometimes cited as further support for this theory, especially by Bickerton's controversial language bioprogram theory. Creoles are languages that develop and form when disparate societies come together and are forced to devise a new system of communication. The system used by the original speakers is typically an inconsistent mix of vocabulary items, known as a pidgin. As these speakers' children begin to acquire their first language, they use the pidgin input to effectively create their own original language, known as a creole. Unlike pidgins, creoles have native speakers (those with acquisition from early childhood) and make use of a full, systematic grammar.

According to Bickerton, the idea of universal grammar is supported by creole languages because certain features are shared by virtually all in the category. For example, their default point of reference in time (expressed by bare verb stems) is not the present moment, but the past. Using pre-verbal auxiliaries, they uniformly express tense, aspect, and mood. Negative concord occurs, but it affects the verbal subject (as opposed to the object, as it does in languages like Spanish). Another similarity among creoles can be seen in the fact that questions are created simply by changing the intonation of a declarative sentence, not its word order or content.

However, extensive work by Carla Hudson-Kam and Elissa Newport suggests that creole languages may not support a universal grammar at all. In a series of experiments, Hudson-Kam and Newport looked at how children and adults learn artificial grammars. They found that children tend to ignore minor variations in the input when those variations are infrequent, and reproduce only the most frequent forms. In doing so, they tend to standardize the language that they hear around them. Hudson-Kam and Newport hypothesize that in a pidgin-development situation (and in the real-life situation of a deaf child whose parents are or were disfluent signers), children systematize the language they hear, based on the probability and frequency of forms, and not that which has been suggested on the basis of a universal grammar. Further, it seems to follow that creoles would share features with the languages from which they are derived, and thus look similar in terms of grammar.
Many researchers of universal grammar argue against a concept of relexification, which says that a language replaces its lexicon almost entirely with that of another. This goes against universalist ideas of a universal grammar, which has an innate grammar.

Criticisms

Geoffrey Sampson maintains that universal grammar theories are not falsifiable and are therefore pseudoscientific. He argues that the grammatical "rules" linguists posit are simply post-hoc observations about existing languages, rather than predictions about what is possible in a language. Similarly, Jeffrey Elman argues that the unlearnability of languages assumed by universal grammar is based on a too-strict, "worst-case" model of grammar, that is not in keeping with any actual grammar. In keeping with these points, James Hurford argues that the postulate of a language acquisition device (LAD) essentially amounts to the trivial claim that languages are learnt by humans, and thus, that the LAD is less a theory than an explanandum looking for theories.

Morten H. Christiansen and Nick Chater have argued that the relatively fast-changing nature of language would prevent the slower-changing genetic structures from ever catching up, undermining the possibility of a genetically hard-wired universal grammar. Instead of an innate universal grammar, they claim, "apparently arbitrary aspects of linguistic structure may result from general learning and processing biases deriving from the structure of thought processes, perceptuo-motor factors, cognitive limitations, and pragmatics".

Hinzen summarizes the most common criticisms of universal grammar:
  • Universal grammar has no coherent formulation and is indeed unnecessary.
  • Universal grammar is in conflict with biology: it cannot have evolved by standardly accepted neo-Darwinian evolutionary principles.
  • There are no linguistic universals: universal grammar is refuted by abundant variation at all levels of linguistic organization, which lies at the heart of human faculty of language.
In addition, it has been suggested that people learn about probabilistic patterns of word distributions in their language, rather than hard and fast rules. For example, children overgeneralize the past tense marker "ed" and conjugate irregular verbs incorrectly, producing forms like goed and eated and correct these errors over time. It has also been proposed that the poverty of the stimulus problem can be largely avoided, if it is assumed that children employ similarity-based generalization strategies in language learning, generalizing about the usage of new words from similar words that they already know how to use.

Language acquisition researcher Michael Ramscar has suggested that when children erroneously expect an ungrammatical form that then never occurs, the repeated failure of expectation serves as a form of implicit negative feedback that allows them to correct their errors over time such as how children correct grammar generalizations like goed to went through repetitive failure. This implies that word learning is a probabilistic, error-driven process, rather than a process of fast mapping, as many nativists assume.

In the domain of field research, the Pirahã language is claimed to be a counterexample to the basic tenets of universal grammar. This research has been led by Daniel Everett. Among other things, this language is alleged to lack all evidence for recursion, including embedded clauses, as well as quantifiers and colour terms. According to the writings of Everett, the Pirahã showed these linguistic shortcomings not because they were simple-minded, but because their culture—which emphasized concrete matters in the present and also lacked creation myths and traditions of art making—did not necessitate it. Some other linguists have argued, however, that some of these properties have been misanalyzed, and that others are actually expected under current theories of universal grammar. Other linguists have attempted to reassess Pirahã to see if it did indeed use recursion. In a corpus analysis of the Pirahã language, linguists failed to disprove Everett's arguments against universal grammar and the lack of recursion in Pirahã. However, they also stated that there was "no strong evidence for the lack of recursion either" and they provided "suggestive evidence that Pirahã may have sentences with recursive structures".

Daniel Everett has gone as far as claiming that universal grammar does not exist. In his words, "universal grammar doesn't seem to work, there doesn't seem to be much evidence for [it]. And what can we put in its place? A complex interplay of factors, of which culture, the values human beings share, plays a major role in structuring the way that we talk and the things that we talk about." Michael Tomasello, a developmental psychologist, also supports this claim, arguing that "although many aspects of human linguistic competence have indeed evolved biologically, specific grammatical principles and constructions have not. And universals in the grammatical structure of different languages have come from more general processes and constraints of human cognition, communication, and vocal-auditory processing, operating during the conventionalization and transmission of the particular grammatical constructions of particular linguistic communities."

Introduction to entropy

From Wikipedia, the free encyclopedia https://en.wikipedia.org/wiki/Introduct...