数据库 1970

大型共享数据库的关系数据模型

埃德加·F·科德（IBM）

把数据存成一张张关系表——让每个程序都不必再关心字节如何排布。

Choose your version

In depth · the introduction

在 1970 年以前，向计算机要数据，意味着你得确切知道它躺在哪里。科德的想法激进却又简单：把一切都存成普通的表，然后，只管描述你想要什么。

核心想法

数据库，是存放海量、有组织信息的地方——客户、订单、航班、账户。在 1960 年代，要从中取出一个答案，得沿着设计者预先铺好的、一条僵硬的指针小径走下去；程序必须知道通往每一块数据的物理路径，于是任何对「数据如何存储」的改动，都可能把它们弄坏。

IBM 的数学家埃德加·科德，提出了一个简单得多的办法。把所有数据都存成普通的表——他称之为关系——其中每一行是一条记录，每一列是一类事实。然后，让人们用「描述」的方式去要数据（「所有在巴黎的供应商」），而不是告诉机器该如何去取。「怎么取」，交给计算机自己想。这种分离——把「你想要什么」与「它住在哪里」分开——叫做数据独立性，它改变了一切。

它是如何诞生的

科德 1970 年在一份研究期刊上发表了这个想法，起初遭遇了抵触——哪怕在 IBM 内部，因为公司当时正重金押注于一套更老的系统。许多工程师根本不相信，一个建立在数学表格之上的数据库，能快到足以实用。这场分歧，在 1974 年科德与查尔斯·巴赫曼——那套更老的「网状」路线的旗手——之间一场著名的公开辩论中，达到了顶点。

了结它的，不是辩论，而是能跑起来的软件。两支团队把理论变成了真实的系统：IBM 自家位于圣何塞的 System R 项目，它产出了查询语言 SQL；以及伯克利的 Ingres 项目。它们证明了，一个关系型数据库可以既优雅、又快。到 1980 年代，关系模型已经胜出；科德于 1981 年获得了计算领域的最高荣誉——图灵奖。

它为何重要

几乎每一个你曾间接打过交道的数据库——你的银行、你的机票预订、一家网店、一所医院的病历——都是关系型的，是这篇论文的后裔。与它们交谈的语言 SQL，成了世上使用最广的语言之一。通过让人们按信息的「含义」、而非「位置」去索取它，科德把数据变成了一种你无须身为存储专家、也能去推理、去组合、去信赖的东西。

一个可以想象的画面

想象一座没有目录的旧图书馆：要找一本书，你得知道它确切在哪一格书架上；而一旦馆员重新摆了架，你那套「指路」便全作废了。科德的模型，就是那份目录。你描述想要的书——作者、主题——系统便去把它找出来，无论它被摆到了哪里。为了效率重新摆架，再也不会弄坏任何东西，因为你从一开始就没有依赖过它的位置。

它的位置

科德取来一件纯数学的工具——「关系即一组元组」这个想法，正是支撑大半逻辑学的那套集合论——把它对准了一个琐碎而实际的问题：如何存放一家公司的记录。两侧的线索，本馆都有：香农与图灵奠定了它所依凭的信息与计算的理论，而科德那些表的后裔，如今正盛放着今天的 AI——从 Transformer 起——赖以学习的数据。

The original document

Original source text

摘要与引言

E. F. Codd · Communications of the ACM 13, no. 6 (June 1970): 377–387 · IBM Research Laboratory, San Jose, California

Abstract

Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation).

The abstract argues that activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed, and even when some aspects of the external representation are changed — because such changes are inevitable as query, update, and report traffic and the stored data both grow.

A model based on n-ary relations, a normal form for data base relations, and the concept of a universal data sublanguage are introduced.

1.1. Introduction

Codd notes that existing formatted-data systems give users tree-structured files or slightly more general network models, and then names the target of the paper precisely:

… the problems treated here are those of data independence — the independence of application programs and terminal activities from growth in data types and changes in data representation …

1.2 当下的数据依赖

Section 1.2 catalogues the ways present systems force a program to depend on storage decisions it should not have to know about. Codd separates three: ordering dependence (the program assumes records sit in a particular order), indexing dependence (the program must know which indices exist), and access path dependence (the program must navigate a fixed hierarchy or network of pointers to reach the data).

Each dependence means that a change made for performance — adding an index, re-ordering a file, regrouping records — can break working application logic. The relational model is offered as the cure: a level of description above all of these choices.

[ … ]

1.3 数据的关系视角

1.3. A Relational View of Data

The term relation is used here in its accepted mathematical sense.

Given sets S1, S2, …, Sn (not necessarily distinct), R is a relation on these n sets if it is a set of n-tuples each of which has its first element from S1, its second element from S2, and so on; equivalently, R is a subset of the Cartesian product S1 × S2 × … × Sn. Each Sj is called the jth domain of R.

R is said to have degree n. Relations of degree 1 are often called unary, degree 2 binary, degree 3 ternary, and degree n n-ary.

An array that represents an n-ary relation has four properties: (1) each row is an n-tuple; (2) the ordering of rows is immaterial; (3) all rows are distinct; (4) the ordering of columns matches the ordering of the domains. These four rules are exactly what separates a relation from an ordinary file.

The totality of data in a data bank may be viewed as a collection of time-varying relations.

1.4 范式

1.4. Normal Form

A relation all of whose domains are simple (atomic) can be stored as a flat, two-dimensional, column-homogeneous array. A relation with a nonsimple domain — a relation nested inside another relation — needs a more complicated structure. Codd shows the nesting can always be removed:

There is, in fact, a very simple elimination procedure, which we shall call normalization.

Worked through his employee / job-history / children example, normalization replaces one nested relation by several flat relations linked through shared domain values. This single paragraph is the seed of the normal-form theory (1NF, 2NF, 3NF, and onward) that Codd and others would develop over the next several years.

2. 关系运算与冗余

2.1. Operations on Relations

Because relations are sets, Codd defines an algebra over them — permutation, projection, join, composition, restriction — whose results are themselves relations. Projection is defined exactly:

Projection. Suppose now we select certain columns of a relation (striking out the others) and then remove from the resulting array any duplication in the rows. The final array represents a relation …

The natural join R*S combines two relations on a shared domain; composition generalises it. With these operations Codd attacks the rest of the paper's agenda — defining redundancy among the stored relations, and the consistency conditions a system must maintain — entirely in terms of the relations themselves, never their storage.