The term data model is used in two distinct but closely
related senses. Sometimes it refers to an abstract formalization of the
objects and relationships found in a particular application domain, for
example the customers, products, and orders found in a manufacturing
organization. At other times it refers to a set of concepts used in
defining such formalizations: for example concepts such as entities,
attributes, relations, or tables. So the "data model" of a banking
application may be defined using the entity-relationship "data model".
This article uses the term in both senses.
A data model explicitly determines the structure of data. Data models are specified in a data modeling notation, which is often graphical in form.
A data model can sometimes be referred to as a data structure, especially in the context of programming languages. Data models are often complemented by function models, especially in the context of enterprise models.
Overview
Managing large quantities of structured and unstructured data is a primary function of information systems.
Data models describe the structure, manipulation and integrity aspects
of the data stored in data management systems such as relational
databases. They typically do not describe unstructured data, such as word processing documents, email messages, pictures, digital audio, and video.
The role of data models
The main aim of data models is to support the development of information systems
by providing the definition and format of data. According to West and
Fowler (1999) "if this is done consistently across systems then
compatibility of data can be achieved. If the same data structures are
used to store and access data then different applications can share
data. The results of this are indicated above. However, systems and
interfaces often cost more than they should, to build, operate, and
maintain. They may also constrain the business rather than support it. A
major cause is that the quality of the data models implemented in
systems and interfaces is poor".
- "Business rules, specific to how things are done in a particular place, are often fixed in the structure of a data model. This means that small changes in the way business is conducted lead to large changes in computer systems and interfaces".
- "Entity types are often not identified, or incorrectly identified. This can lead to replication of data, data structure, and functionality, together with the attendant costs of that duplication in development and maintenance".
- "Data models for different systems are arbitrarily different. The result of this is that complex interfaces are required between systems that share data. These interfaces can account for between 25-70% of the cost of current systems".
- "Data cannot be shared electronically with customers and suppliers, because the structure and meaning of data has not been standardized. For example, engineering design data and drawings for process plant are still sometimes exchanged on paper".
The reason for these problems is a lack of standards that will ensure
that data models will both meet business needs and be consistent.
A data model explicitly determines the structure of data. Typical
applications of data models include database models, design of
information systems, and enabling exchange of data. Usually data models
are specified in a data modeling language.
Three perspectives
A data model instance may be one of three kinds according to ANSI in 1975:
- Conceptual data model : describes the semantics of a domain, being the scope of the model. For example, it may be a model of the interest area of an organization or industry. This consists of entity classes, representing kinds of things of significance in the domain, and relationship assertions about associations between pairs of entity classes. A conceptual schema specifies the kinds of facts or propositions that can be expressed using the model. In that sense, it defines the allowed expressions in an artificial 'language' with a scope that is limited by the scope of the model.
- Logical data model : describes the semantics, as represented by a particular data manipulation technology. This consists of descriptions of tables and columns, object oriented classes, and XML tags, among other things.
- Physical data model : describes the physical means by which data are stored. This is concerned with partitions, CPUs, tablespaces, and the like.
The significance of this approach, according to ANSI, is that it
allows the three perspectives to be relatively independent of each
other. Storage technology can change without affecting either the
logical or the conceptual model. The table/column structure can change
without (necessarily) affecting the conceptual model. In each case, of
course, the structures must remain consistent with the other model. The
table/column structure may be different from a direct translation of
the entity classes and attributes, but it must ultimately carry out the
objectives of the conceptual entity class structure. Early phases of
many software development projects emphasize the design of a conceptual data model. Such a design can be detailed into a logical data model. In later stages, this model may be translated into physical data model. However, it is also possible to implement a conceptual model directly.
History
One of the earliest pioneering works in modelling information systems was done by Young and Kent (1958), who argued for "a precise and abstract way of specifying the informational and time characteristics of a data processing problem". They wanted to create "a notation that should enable the analyst to organize the problem around any piece of hardware".
Their work was a first effort to create an abstract specification and
invariant basis for designing different alternative implementations
using different hardware components. A next step in IS modelling was
taken by CODASYL,
an IT industry consortium formed in 1959, who essentially aimed at the
same thing as Young and Kent: the development of "a proper structure for
machine independent problem definition language, at the system level of
data processing". This led to the development of a specific IS information algebra.
In the 1960s data modeling gained more significance with the initiation of the management information system
(MIS) concept. According to Leondes (2002), "during that time, the
information system provided the data and information for management
purposes. The first generation database system, called Integrated Data Store (IDS), was designed by Charles Bachman at General Electric. Two famous database models, the network data model and the hierarchical data model, were proposed during this period of time". Towards the end of the 1960s, Edgar F. Codd worked out his theories of data arrangement, and proposed the relational model for database management based on first-order predicate logic.
In the 1970s entity relationship modeling emerged as a new type of conceptual data modeling, originally proposed in 1976 by Peter Chen. Entity relationship models were being used in the first stage of information system design during the requirements analysis to describe information needs or the type of information that is to be stored in a database. This technique can describe any ontology, i.e., an overview and classification of concepts and their relationships, for a certain area of interest.
In the 1970s G.M. Nijssen developed "Natural Language Information Analysis Method" (NIAM) method, and developed this in the 1980s in cooperation with Terry Halpin into Object-Role Modeling
(ORM). However, it was Terry Halpin's 1989 PhD thesis that created the
formal foundation on which Object-Role Modeling is based.
Bill Kent, in his 1978 book Data and Reality,
compared a data model to a map of a territory, emphasizing that in the
real world, "highways are not painted red, rivers don't have county
lines running down the middle, and you can't see contour lines on a
mountain". In contrast to other researchers who tried to create models
that were mathematically clean and elegant, Kent emphasized the
essential messiness of the real world, and the task of the data modeller
to create order out of chaos without excessively distorting the truth.
In the 1980s, according to Jan L. Harrington (2000), "the development of the object-oriented
paradigm brought about a fundamental change in the way we look at data
and the procedures that operate on data. Traditionally, data and
procedures have been stored separately: the data and their relationship
in a database, the procedures in an application program. Object
orientation, however, combined an entity's procedure with its data."
Types of data models
Database model
A database model is a specification describing how a database is structured and used.
Several such models have been suggested. Common models include:
- Flat model
- This may not strictly qualify as a data model. The flat (or table) model consists of a single, two-dimensional array of data elements, where all members of a given column are assumed to be similar values, and all members of a row are assumed to be related to one another.
- Hierarchical model
- The hierarchical model is similar to the network model except that links in the hierarchical model form a tree structure,while the network model allows arbitrary graph.
- Network model
- This model organizes data using two fundamental constructs, called records and sets. Records contain fields, and sets define one-to-many relationships between records: one owner, many members.The network data model is an abstraction of the design concept used in the implementation of databases.
- Relational model
- This is a database model based on first-order predicate logic. Its core idea is to describe a database as a collection of predicates over a finite set of predicate variables, describing constraints on the possible values and combinations of values.The power of the relational data model lies in its mathematical foundations and a simple user-level paradigm.
- Object-relational model
- Similar to a relational database model, but objects, classes and inheritance are directly supported in database schemas and in the query language.
- Object-role modeling
- A method of data modeling that has been defined as "attribute free", and "fact based". The result is a verifiably correct system, from which other common artifacts, such as ERD, UML, and semantic models may be derived. Associations between data objects are described during the database design procedure, such that normalization is an inevitable result of the process.
- Star schema
- The simplest style of data warehouse schema. The star schema consists of a few "fact tables" (possibly only one, justifying the name) referencing any number of "dimension tables". The star schema is considered an important special case of the snowflake schema.
Data structure diagram
A data structure diagram (DSD) is a diagram and data model used to describe conceptual data models by providing graphical notations which document entities and their relationships, and the constraints that bind them. The basic graphic elements of DSDs are boxes, representing entities, and arrows, representing relationships. Data structure diagrams are most useful for documenting complex data entities.
Data structure diagrams are an extension of the entity-relationship model (ER model). In DSDs, attributes
are specified inside the entity boxes rather than outside of them,
while relationships are drawn as boxes composed of attributes which
specify the constraints that bind entities together. DSDs differ from
the ER model in that the ER model focuses on the relationships between
different entities, whereas DSDs focus on the relationships of the
elements within an entity and enable users to fully see the links and
relationships between each entity.
There are several styles for representing data structure diagrams, with the notable difference in the manner of defining cardinality. The choices are between arrow heads, inverted arrow heads (crow's feet), or numerical representation of the cardinality.
Entity-relationship model
An entity-relationship model (ERM), sometimes referred to as an
entity-relationship diagram (ERD), could be used to represent an
abstract conceptual data model (or semantic data model or physical data model) used in software engineering to represent structured data. There are several notations used for ERMs. Like DSD's, attributes
are specified inside the entity boxes rather than outside of them,
while relationships are drawn as lines, with the relationship
constraints as descriptions on the line. The E-R model, while robust,
can become visually cumbersome when representing entities with several
attributes.
There are several styles for representing data structure
diagrams, with the notable difference in the manner of defining
cardinality. The choices are between arrow heads, inverted arrow heads
(crow's feet), or numerical representation of the cardinality.
Geographic data model
A data model in Geographic information systems is a mathematical construct for representing geographic objects or surfaces as data. For example,
- the vector data model represents geography as collections of points, lines, and polygons;
- the raster data model represent geography as cell matrices that store numeric values;
- and the Triangulated irregular network (TIN) data model represents geography as sets of contiguous, nonoverlapping triangles.
Generic data model
Generic data models are generalizations of conventional data models.
They define standardized general relation types, together with the kinds
of things that may be related by such a relation type. Generic data
models are developed as an approach to solve some shortcomings of
conventional data models. For example, different modelers usually
produce different conventional data models of the same domain. This can
lead to difficulty in bringing the models of different people together
and is an obstacle for data exchange and data integration. Invariably,
however, this difference is attributable to different levels of
abstraction in the models and differences in the kinds of facts that can
be instantiated (the semantic expression capabilities of the models).
The modelers need to communicate and agree on certain elements which are
to be rendered more concretely, in order to make the differences less
significant.
Semantic data model
A semantic data model in software engineering is a technique to
define the meaning of data within the context of its interrelationships
with other data. A semantic data model is an abstraction which defines
how the stored symbols relate to the real world. A semantic data model is sometimes called a conceptual data model.
The logical data structure of a database management system (DBMS), whether hierarchical, network, or relational, cannot totally satisfy the requirements
for a conceptual definition of data because it is limited in scope and
biased toward the implementation strategy employed by the DBMS.
Therefore, the need to define data from a conceptual view
has led to the development of semantic data modeling techniques. That
is, techniques to define the meaning of data within the context of its
interrelationships with other data. As illustrated in the figure. The
real world, in terms of resources, ideas, events, etc., are symbolically
defined within physical data stores. A semantic data model is an
abstraction which defines how the stored symbols relate to the real
world. Thus, the model must be a true representation of the real world.
Data model topics
Data architecture
Data architecture is the design of data for use in defining the
target state and the subsequent planning needed to hit the target state.
It is usually one of several architecture domains that form the pillars of an enterprise architecture or solution architecture.
A data architecture describes the data structures used by a
business and/or its applications. There are descriptions of data in
storage and data in motion; descriptions of data stores, data groups and
data items; and mappings of those data artifacts to data qualities,
applications, locations, etc.
Essential to realizing the target state, Data architecture
describes how data is processed, stored, and utilized in a given system.
It provides criteria for data processing operations that make it
possible to design data flows and also control the flow of data in the
system.
Data modeling
Data modeling in software engineering
is the process of creating a data model by applying formal data model
descriptions using data modeling techniques. Data modeling is a
technique for defining business requirements for a database. It is sometimes called database modeling because a data model is eventually implemented in a database.
The figure illustrates the way data models are developed and used today. A conceptual data model is developed based on the data requirements for the application that is being developed, perhaps in the context of an activity model.
The data model will normally consist of entity types, attributes,
relationships, integrity rules, and the definitions of those objects.
This is then used as the start point for interface or database design.
Data properties
Some important properties of data for which requirements need to be met are:
- definition-related properties
- relevance: the usefulness of the data in the context of your business.
- clarity: the availability of a clear and shared definition for the data.
- consistency: the compatibility of the same type of data from different sources.
- content-related properties
- timeliness: the availability of data at the time required and how up to date that data is.
- accuracy: how close to the truth the data is.
- properties related to both definition and content
- completeness: how much of the required data is available.
- accessibility: where, how, and to whom the data is available or not available (e.g. security).
- cost: the cost incurred in obtaining the data, and making it available for use.
Data organization
Another kind of data model describes how to organize data using a database management system
or other data management technology. It describes, for example,
relational tables and columns or object-oriented classes and attributes.
Such a data model is sometimes referred to as the physical data model,
but in the original ANSI three schema architecture, it is called
"logical". In that architecture, the physical model describes the
storage media (cylinders, tracks, and tablespaces). Ideally, this model
is derived from the more conceptual data model described above. It may
differ, however, to account for constraints like processing capacity and
usage patterns.
While data analysis is a common term for data modeling, the activity actually has more in common with the ideas and methods of synthesis (inferring general concepts from particular instances) than it does with analysis (identifying component concepts from more general ones). {Presumably we call ourselves systems analysts because no one can say systems synthesists.}
Data modeling strives to bring the data structures of interest
together into a cohesive, inseparable, whole by eliminating unnecessary
data redundancies and by relating data structures with relationships.
A different approach is to use adaptive systems such as artificial neural networks that can autonomously create implicit models of data.
Data structure
A data structure is a way of storing data in a computer so that it
can be used efficiently. It is an organization of mathematical and
logical concepts of data. Often a carefully chosen data structure will
allow the most efficient algorithm to be used. The choice of the data structure often begins from the choice of an abstract data type.
A data model describes the structure of the data within a given
domain and, by implication, the underlying structure of that domain
itself. This means that a data model in fact specifies a dedicated grammar
for a dedicated artificial language for that domain. A data model
represents classes of entities (kinds of things) about which a company
wishes to hold information, the attributes of that information, and
relationships among those entities and (often implicit) relationships
among those attributes. The model describes the organization of the data
to some extent irrespective of how data might be represented in a
computer system.
The entities represented by a data model can be the tangible
entities, but models that include such concrete entity classes tend to
change over time. Robust data models often identify abstractions
of such entities. For example, a data model might include an entity
class called "Person", representing all the people who interact with an
organization. Such an abstract entity
class is typically more appropriate than ones called "Vendor" or
"Employee", which identify specific roles played by those people.
Data model theory
The term data model can have two meanings:
- A data model theory, i.e. a formal description of how data may be structured and accessed.
- A data model instance, i.e. applying a data model theory to create a practical data model instance for some particular application.
A data model theory has three main components:
- The structural part: a collection of data structures which are used to create databases representing the entities or objects modeled by the database.
- The integrity part: a collection of rules governing the constraints placed on these data structures to ensure structural integrity.
- The manipulation part: a collection of operators which can be applied to the data structures, to update and query the data contained in the database.
For example, in the relational model, the structural part is based on a modified concept of the mathematical relation; the integrity part is expressed in first-order logic and the manipulation part is expressed using the relational algebra, tuple calculus and domain calculus.
A data model instance is created by applying a data model theory.
This is typically done to solve some business enterprise requirement.
Business requirements are normally captured by a semantic logical data model.
This is transformed into a physical data model instance from which is
generated a physical database. For example, a data modeler may use a
data modeling tool to create an entity-relationship model of the corporate data repository of some business enterprise. This model is transformed into a relational model, which in turn generates a relational database.
Patterns
Patterns are common data modeling structures that occur in many data models.
Related models
Data flow diagram
A data flow diagram (DFD) is a graphical representation of the "flow" of data through an information system. It differs from the flowchart as it shows the data flow instead of the control flow of the program. A data flow diagram can also be used for the visualization of data processing (structured design). Data flow diagrams were invented by Larry Constantine, the original developer of structured design, based on Martin and Estrin's "data flow graph" model of computation.
It is common practice to draw a context-level Data flow diagram first which shows the interaction between the system and outside entities. The DFD
is designed to show how a system is divided into smaller portions and
to highlight the flow of data between those parts. This context-level
Data flow diagram is then "exploded" to show more detail of the system
being modeled
Information model
An Information model is not a type of data model, but more or less an
alternative model. Within the field of software engineering both a data
model and an information model can be abstract, formal representations
of entity types that includes their properties, relationships and the
operations that can be performed on them. The entity types in the model
may be kinds of real-world objects, such as devices in a network, or
they may themselves be abstract, such as for the entities used in a
billing system. Typically, they are used to model a constrained domain
that can be described by a closed set of entity types, properties,
relationships and operations.
According to Lee (1999) an information model is a representation of concepts, relationships, constraints, rules, and operations to specify data semantics
for a chosen domain of discourse. It can provide sharable, stable, and
organized structure of information requirements for the domain context. More in general the term information model
is used for models of individual things, such as facilities, buildings,
process plants, etc. In those cases the concept is specialised to Facility Information Model, Building Information Model,
Plant Information Model, etc. Such an information model is an
integration of a model of the facility with the data and documents about
the facility.
An information model provides formalism to the description of a
problem domain without constraining how that description is mapped to an
actual implementation in software. There may be many mappings of the
information model. Such mappings are called data models, irrespective of
whether they are object models (e.g. using UML), entity relationship models or XML schemas.
Object model
An object model in computer science is a collection of objects or
classes through which a program can examine and manipulate some specific
parts of its world. In other words, the object-oriented interface to
some service or system. Such an interface is said to be the object model of the represented service or system. For example, the Document Object Model (DOM) is a collection of objects that represent a page in a web browser, used by script programs to examine and dynamically change the page. There is a Microsoft Excel object model for controlling Microsoft Excel from another program, and the ASCOM Telescope Driver is an object model for controlling an astronomical telescope.
In computing the term object model has a distinct second meaning of the general properties of objects in a specific computer programming language, technology, notation or methodology that uses them. For example, the Java object model, the COM object model, or the object model of OMT. Such object models are usually defined using concepts such as class, message, inheritance, polymorphism, and encapsulation. There is an extensive literature on formalized object models as a subset of the formal semantics of programming languages.
Object-Role Model
Object-Role Modeling (ORM) is a method for conceptual modeling, and can be used as a tool for information and rules analysis.
Object-Role Modeling is a fact-oriented method for performing systems analysis
at the conceptual level. The quality of a database application depends
critically on its design. To help ensure correctness, clarity,
adaptability and productivity, information systems are best specified
first at the conceptual level, using concepts and language that people
can readily understand.
The conceptual design may include data, process and behavioral
perspectives, and the actual DBMS used to implement the design might be
based on one of many logical data models (relational, hierarchic,
network, object-oriented etc.).
Unified Modeling Language models
The Unified Modeling Language (UML) is a standardized general-purpose modeling language in the field of software engineering. It is a graphical language for visualizing, specifying, constructing, and documenting the artifacts of a software-intensive system. The Unified Modeling Language offers a standard way to write a system's blueprints, including:
- Conceptual things such as business processes and system functions
- Concrete things such as programming language statements, database schemas, and
- Reusable software components.
UML offers a mix of functional models, data models, and database models.