資料庫 1970

大型共享資料庫的關係資料模型

埃德加·F·科德（IBM）

把資料存成一張張關係表——讓每個程式都不必再關心位元如何排布。

Choose your version

In depth · the introduction

在 1970 年以前，向電腦要資料，意味著你得確切知道它躺在哪裡。科德的想法激進卻又簡單：把一切都存成普通的表，然後，只管描述你想要什麼。

核心想法

資料庫，是存放海量、有組織資訊的地方——客戶、訂單、航班、帳戶。在 1960 年代，要從中取出一個答案，得沿著設計者預先鋪好的、一條僵硬的指標小徑走下去；程式必須知道通往每一塊資料的物理路徑，於是任何對「資料如何儲存」的改動，都可能把它們弄壞。

IBM 的數學家埃德加·科德，提出了一個簡單得多的辦法。把所有資料都存成普通的表——他稱之為關係——其中每一列是一筆記錄，每一欄是一類事實。然後，讓人們用「描述」的方式去要資料（「所有在巴黎的供應商」），而不是告訴機器該如何去取。「怎麼取」，交給電腦自己想。這種分離——把「你想要什麼」與「它住在哪裡」分開——叫做資料獨立性，它改變了一切。

它是如何誕生的

科德 1970 年在一份研究期刊上發表了這個想法，起初遭遇了抵觸——哪怕在 IBM 內部，因為公司當時正重金押注於一套更老的系統。許多工程師根本不相信，一個建立在數學表格之上的資料庫，能快到足以實用。這場分歧，在 1974 年科德與查爾斯·巴赫曼——那套更老的「網狀」路線的旗手——之間一場著名的公開辯論中，達到了頂點。

了結它的，不是辯論，而是能跑起來的軟體。兩支團隊把理論變成了真實的系統：IBM 自家位於聖荷西的 System R 專案，它產出了查詢語言 SQL；以及柏克萊的 Ingres 專案。它們證明了，一個關係型資料庫可以既優雅、又快。到 1980 年代，關係模型已經勝出；科德於 1981 年獲得了計算領域的最高榮譽——圖靈獎。

它為何重要

幾乎每一個你曾間接打過交道的資料庫——你的銀行、你的機票預訂、一家網店、一所醫院的病歷——都是關係型的，是這篇論文的後裔。與它們交談的語言 SQL，成了世上使用最廣的語言之一。透過讓人們按資訊的「含義」、而非「位置」去索取它，科德把資料變成了一種你無須身為儲存專家、也能去推理、去組合、去信賴的東西。

一個可以想像的畫面

想像一座沒有目錄的舊圖書館：要找一本書，你得知道它確切在哪一格書架上；而一旦館員重新擺了架，你那套「指路」便全作廢了。科德的模型，就是那份目錄。你描述想要的書——作者、主題——系統便去把它找出來，無論它被擺到了哪裡。為了效率重新擺架，再也不會弄壞任何東西，因為你從一開始就沒有依賴過它的位置。

它的位置

科德取來一件純數學的工具——「關係即一組元組」這個想法，正是支撐大半邏輯學的那套集合論——把它對準了一個瑣碎而實際的問題：如何存放一家公司的記錄。兩側的線索，本館都有：香農與圖靈奠定了它所依憑的資訊與計算的理論，而科德那些表的後裔，如今正盛放著今天的 AI——從 Transformer 起——賴以學習的資料。

The original document

Original source text

摘要與引言

E. F. Codd · Communications of the ACM 13, no. 6 (June 1970): 377–387 · IBM Research Laboratory, San Jose, California

Abstract

Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation).

The abstract argues that activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed, and even when some aspects of the external representation are changed — because such changes are inevitable as query, update, and report traffic and the stored data both grow.

A model based on n-ary relations, a normal form for data base relations, and the concept of a universal data sublanguage are introduced.

1.1. Introduction

Codd notes that existing formatted-data systems give users tree-structured files or slightly more general network models, and then names the target of the paper precisely:

… the problems treated here are those of data independence — the independence of application programs and terminal activities from growth in data types and changes in data representation …

1.2 當下的資料依賴

Section 1.2 catalogues the ways present systems force a program to depend on storage decisions it should not have to know about. Codd separates three: ordering dependence (the program assumes records sit in a particular order), indexing dependence (the program must know which indices exist), and access path dependence (the program must navigate a fixed hierarchy or network of pointers to reach the data).

Each dependence means that a change made for performance — adding an index, re-ordering a file, regrouping records — can break working application logic. The relational model is offered as the cure: a level of description above all of these choices.

[ … ]

1.3 資料的關係視角

1.3. A Relational View of Data

The term relation is used here in its accepted mathematical sense.

Given sets S1, S2, …, Sn (not necessarily distinct), R is a relation on these n sets if it is a set of n-tuples each of which has its first element from S1, its second element from S2, and so on; equivalently, R is a subset of the Cartesian product S1 × S2 × … × Sn. Each Sj is called the jth domain of R.

R is said to have degree n. Relations of degree 1 are often called unary, degree 2 binary, degree 3 ternary, and degree n n-ary.

An array that represents an n-ary relation has four properties: (1) each row is an n-tuple; (2) the ordering of rows is immaterial; (3) all rows are distinct; (4) the ordering of columns matches the ordering of the domains. These four rules are exactly what separates a relation from an ordinary file.

The totality of data in a data bank may be viewed as a collection of time-varying relations.

1.4 範式

1.4. Normal Form

A relation all of whose domains are simple (atomic) can be stored as a flat, two-dimensional, column-homogeneous array. A relation with a nonsimple domain — a relation nested inside another relation — needs a more complicated structure. Codd shows the nesting can always be removed:

There is, in fact, a very simple elimination procedure, which we shall call normalization.

Worked through his employee / job-history / children example, normalization replaces one nested relation by several flat relations linked through shared domain values. This single paragraph is the seed of the normal-form theory (1NF, 2NF, 3NF, and onward) that Codd and others would develop over the next several years.

2. 關係運算與冗餘

2.1. Operations on Relations

Because relations are sets, Codd defines an algebra over them — permutation, projection, join, composition, restriction — whose results are themselves relations. Projection is defined exactly:

Projection. Suppose now we select certain columns of a relation (striking out the others) and then remove from the resulting array any duplication in the rows. The final array represents a relation …

The natural join R*S combines two relations on a shared domain; composition generalises it. With these operations Codd attacks the rest of the paper's agenda — defining redundancy among the stored relations, and the consistency conditions a system must maintain — entirely in terms of the relations themselves, never their storage.