Databases 1970

A Relational Model of Data for Large Shared Data Banks

Edgar F. Codd (IBM)

Store data as plain tables of relations — and free every program from how the bytes are arranged.

Choose your version

In depth · the introduction

Before 1970, asking a computer for data meant knowing exactly where it sat. Codd's idea was radical in its simplicity: store everything as plain tables, and just describe what you want.

The big idea

A database is a place to keep huge amounts of organised information — customers, orders, flights, accounts. In the 1960s, getting an answer out of one meant following a rigid trail of pointers the designer had laid down in advance; programs had to know the physical path to every piece of data, so any change to how the data was stored could break them.

Edgar Codd, a mathematician at IBM, proposed something far simpler. Keep all the data as ordinary tables — he called them relations — where each row is one record and each column is one kind of fact. Then let people ask for data by describing it ("all suppliers in Paris") instead of telling the machine how to go and fetch it. The computer figures out the how. That separation — between what you want and where it lives — is called data independence, and it changed everything.

How it came about

Codd published the idea in 1970 in a research journal, and at first it met resistance — even inside IBM, which was heavily invested in an older system. Many engineers simply did not believe a database built on mathematical tables could ever be fast enough to use. The disagreement came to a head in a famous 1974 public debate between Codd and Charles Bachman, the champion of the older "network" approach.

What settled it was not debate but working software. Two teams turned the theory into real systems: IBM's own System R project in San Jose, which produced the query language SQL, and the Ingres project at Berkeley. They proved a relational database could be both elegant and fast. By the 1980s the relational model had won; Codd received computing's highest honour, the Turing Award, in 1981.

Why it mattered

Almost every database you have ever indirectly touched — your bank, your airline booking, an online shop, a hospital's records — is a relational one, descended from this paper. The language for talking to them, SQL, became one of the most widely used in the world. By letting people ask for information by its meaning rather than its location, Codd made data something you could reason about, combine, and trust without being a storage expert.

A way to picture it

Think of an old library with no catalogue: to find a book you'd have to know its exact shelf, and if the librarian rearranged the shelves, your directions would be useless. Codd's model is the catalogue. You describe the book you want — author, subject — and the system finds it, no matter where it has been shelved. Rearranging the shelves for efficiency no longer breaks anything, because you never relied on the location in the first place.

Where it sits

Codd took a tool from pure mathematics — the idea of a relation as a set of tuples, the same set theory behind much of logic — and pointed it at a grubby practical problem: how to store a company's records. The Library holds the threads on either side: Shannon and Turing built the theory of information and computation this rests on, and the descendants of Codd's tables now hold the data that today's AI, from the Transformer onward, learns from.

The original document

Original source text

Abstract & Introduction

E. F. Codd · Communications of the ACM 13, no. 6 (June 1970): 377–387 · IBM Research Laboratory, San Jose, California

Abstract

Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation).

The abstract argues that activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed, and even when some aspects of the external representation are changed — because such changes are inevitable as query, update, and report traffic and the stored data both grow.

A model based on n-ary relations, a normal form for data base relations, and the concept of a universal data sublanguage are introduced.

1.1. Introduction

Codd notes that existing formatted-data systems give users tree-structured files or slightly more general network models, and then names the target of the paper precisely:

… the problems treated here are those of data independence — the independence of application programs and terminal activities from growth in data types and changes in data representation …

1.2 Data dependence today

Section 1.2 catalogues the ways present systems force a program to depend on storage decisions it should not have to know about. Codd separates three: ordering dependence (the program assumes records sit in a particular order), indexing dependence (the program must know which indices exist), and access path dependence (the program must navigate a fixed hierarchy or network of pointers to reach the data).

Each dependence means that a change made for performance — adding an index, re-ordering a file, regrouping records — can break working application logic. The relational model is offered as the cure: a level of description above all of these choices.

[ … ]

1.3 A relational view of data

1.3. A Relational View of Data

The term relation is used here in its accepted mathematical sense.

Given sets S1, S2, …, Sn (not necessarily distinct), R is a relation on these n sets if it is a set of n-tuples each of which has its first element from S1, its second element from S2, and so on; equivalently, R is a subset of the Cartesian product S1 × S2 × … × Sn. Each Sj is called the jth domain of R.

R is said to have degree n. Relations of degree 1 are often called unary, degree 2 binary, degree 3 ternary, and degree n n-ary.

An array that represents an n-ary relation has four properties: (1) each row is an n-tuple; (2) the ordering of rows is immaterial; (3) all rows are distinct; (4) the ordering of columns matches the ordering of the domains. These four rules are exactly what separates a relation from an ordinary file.

The totality of data in a data bank may be viewed as a collection of time-varying relations.

1.4 Normal form

1.4. Normal Form

A relation all of whose domains are simple (atomic) can be stored as a flat, two-dimensional, column-homogeneous array. A relation with a nonsimple domain — a relation nested inside another relation — needs a more complicated structure. Codd shows the nesting can always be removed:

There is, in fact, a very simple elimination procedure, which we shall call normalization.

Worked through his employee / job-history / children example, normalization replaces one nested relation by several flat relations linked through shared domain values. This single paragraph is the seed of the normal-form theory (1NF, 2NF, 3NF, and onward) that Codd and others would develop over the next several years.

2. Operations & redundancy

2.1. Operations on Relations

Because relations are sets, Codd defines an algebra over them — permutation, projection, join, composition, restriction — whose results are themselves relations. Projection is defined exactly:

Projection. Suppose now we select certain columns of a relation (striking out the others) and then remove from the resulting array any duplication in the rows. The final array represents a relation …

The natural join R*S combines two relations on a shared domain; composition generalises it. With these operations Codd attacks the rest of the paper's agenda — defining redundancy among the stored relations, and the consistency conditions a system must maintain — entirely in terms of the relations themselves, never their storage.