A database is a collection of data with a formal structure. In order to exploit such a structure, some piece of specialized application software is needed, which is called a Database Management System (DBMS). In the following, two types of database will be discussed:
- standard database
- free field structure database.
We will first consider the standard case of a database and then turn to a complication. A standard database is a file that consists of tables with the following structure:
|column A||column B||column C|
|record 1||field A||field B||field C|
Here is an example of a database that consists of only one such table:
|last name||first name||phone number|
A table consists of records. A record may be conceived as an individual set of data which constitutes an entry of the database. In the standard case, all the records of a table have the same structure, which is determined by the columns. In the example, each record of the telephone directory consists of three fields containing the last name, first name and phone number, respectively. (The row headers ‘record 1, record 2, ...’ here serve expository purposes only and do not belong to the actual table.)
Naturally, a field may remain empty in a given record:
|last name||first name||phone number|
What is not possible, in this model, is for the records of one table to differ in field structure, e.g. in order to take care of Jane Doe's cell phone number:
|last name||first name||phone number|
Such situations are, however, frequent in real-life data, and they do present a problem for databases. There are two ways out:
- Maintain the fixed field structure. Then all the fields that may be needed for any one record have to be added to the table, as an additional column, and may then remain empty for most of the records.
- Renounce to the rigid field structure and work with a database with free field structure.
In our example, solution #1 requires the addition of another column ‘cell phone number’ to the table.
Databases with free field structure
Solution #2 is also a database that consists of records which consist of fields. However, the records differ in their field structure. This has two consequences:
- The records with their fields do not fit into a table.
- The category of each field has to be identified in each record.
The database with the phone directory might then have the following form:
|cell phone number||0172/6213487|
This database consists (again) of three records. This time, the text of the grey cells – the field names – must physically stand in the records, while in the tables with rigid field structure, the names are specified only once for each column. In this sense, a free-field-structure database is like a text file that contains structural information by some markup language.
Alternatively, the record for Jane Doe might have either of the following field structures:
|phone number||0242/235689; 0172/6213487|
In a database with free field structure, a given field may recur in a record as often as is needed. This solution is (not necessarily, but reasonably) accompanied by a stipulation that each instance of a certain field in a record contains only one item of the kind of data in question.
Alternatively, items of information of one category may be sequenced in one field, separated by a suitable punctuation mark like the semicolon or the ‘|’ sign. Choice between the two solutions essentially depends on the capacities of the DBMS: It must be able to sort the records by the content of the field in question. That will generally be easier for the first rather than the second solution. Even then the DBMS must be able to identify non-first occurrences of a field in a record.
Another aspect of field structure that is flexible in this kind of DBMS is the sequential order of the fields in a record. However, it is neither necessary nor advisable to make use of that freedom.
In linguistic work, the kind of data situation illustrated here for a telephone directory arises all the time. Examples from a lexical database are multiple senses of the lemma, sets of examples for the lemma etc.
The algorithmic handling of a database with rigid field structure differs enormously from the handling of a database with free field structure. In the former, the DBMS only needs to inspect the set of column headers of a table and then knows the field structure of each record. In the latter, the DBMS has to read the individual record, identify in it the field names and create an array of them. The former system is, in effect, much simpler and more efficient. On the other hand, databases with free field structure have the advantage of giving the user full freedom in introducing a new field in a certain record that needs it or deleting a field from a record for which it is irrelevant without having to modify the field structure of the entire table. However, science strives for principled, orderly solutions to methodological problems. Consequently, a database with a free field structure may be seen as a pre-terminal step in the process of developing an adequate formal representation for a set of data. However, in empirical linguistic work, the pre-terminal phase sometimes lasts a couple of decades.
There are, to summarize, two kinds of databases:
|record||table-row displayed separately||basic unit|
|table||basic unit||listing of a set of fields shared by records|
So far, we have seen fixed-field-structure databases that consist of only one table. However, this kind of database reveals its full capacity when information is distributed over several tables that are interrelated. Consider a database containing information on whom I have presented which book. Then there is a set P of persons that received book presents from me, and a set B of books copies of which I presented to various subsets of P. This means that there is a many-to-many relation between the elements of P and the elements of B: Each person may have gotten one or more books, and each book may have gone to one or more persons.
In such a situation, the information cannot be stored in one table. Consider the table of persons introduced before. We might add a column for ‘books presented’ and fill into the field the titles of all the books the person received, separated by some punctuation mark as above. That would be possible, but unwise. Although in this way, my database can answer the question ‘which books did I present to this person?’ by showing the content of the additional field, it cannot, or only by clumsy workarounds, answer the question ‘which persons did I give this book to?’, because this book is not represented individually in the database. Again, I have to input each book title into all the records of the persons who received the book. This is not only expendious. If I am inconsistent, inputting abbreviations or errors, then the DBMS will be unable to identify it as the same book for different recipients.
This is the typical data situation for a relational database. It contains a set of tables each of which is devoted to one category of data. The relations between data of one table with data of another table are stored in a third table.
|Table P||Cross-table||Table B|
|001||Smith||John||001||002||001||Shakespeare||As you like it|
The example set of data says that John Smith received the books by Shaw and Chaucer, Jane Doe received the books by Shakespeare and Shaw, and Samantha got no books. And vice versa, the Shakespeare only went to Jane Doe, the Chaucer went only to John Smith, while the Shaw went both to John and to Jane. This relational information is mediated by the record IDs. Thus, the data of one record are never literally repeated in another record. Instead, the other record only contains the ID of the former record, which “points” to it. It is, thus, via the cross-table that the two data tables are related to each other. Technically, the relation between the tables is set up by telling the DBMS that the column ‘person_ID’ of Table P is to be projected onto the column ‘person_ID’ of the cross-table (their bearing the same name is immaterial); and analogously for the two columns named ‘book_ID’.
IDs of records are omnipresent in a relational database. They do not need to be numbers; but numbers are easiest to handle automatically. However, the user of the database is not shown the record IDs. They are only employed by the DBMS to associate the content of some field of a particular record; and only that is what the system displays (see the word-class example below). To achieve this may require some programming on the part of the administrator. But modern DBMSs know that that is the function of IDs and support him in that task. For instance, he may devise a screen form or a report that shows, for every person, which books he received. It more or less looks as follows:
|first name||last name||author||book|
|Jane||Doe||Shakespeare||As you like it|
The important thing to note for the novice in relational databases is that, although precisely this display may be the purpose of the whole enterprise, this is not the structure with which the data are stored in the system; they are stored in the three tables described above. The present table is only one of the ways of visualizing the data.
This model is directly applicable to a large set of data situations in linguistics. For instance, a lexical database contains a table whose records are the lexical entries. There is another table that contains example sentences. Each lexical entry is linked to one or more example sentences, and conversely each sentence may serve as an example in one or more lexical entries. It is highly advisable to store the examples in a separate table, because there they may be maintained and modified in a consistent fashion. If they were literally contained in the lexicon table, then one would be liable to change a certain corpus example in one lexical entry while forgetting that the same example also figures in another entry.
It is also possible to relate the records of one table to each other. Consider again the table whose records are lexical entries. Here is a section of it:
There are hyponymy relations among them, such that entry #31 is a hyponym to entry #428, which latter in turn is a hyponym to entry #2345. This would be represented by a cross-table of the following form:
Here, the column ‘hyp_ID’ of the hyponymy table is mapped onto the column ‘lemma_ID’ of the lexical-entry table, and the column ‘hyper_ID’ of the hyponymy table is mapped onto the same column of the other table. It depends on the capacities of the user interface whether a simple list of pairs ‘hyponym – hyperonym’ is output from this set of tables or whether a neat tree representing the taxonomy ‘rose – flower – table’ may be displayed.
A relational database in the sense defined here is of necessity a fixed-field database. A database with free field structure can partly imitate this functionality by hyperlinks. A hyperlink is a distinguished relation of textual identity of a string contained in one file with a string contained in another file. In establishing the link, one has to make sure that the target string occurs only once in the target file, or occurs in a formal context, e.g. a certain field of a record, so that the software can identify the target there. Thus, the record of a lexical entry may contain a field ‘hyperonym’, which may contain a word which is the lemma of another record. The content of the hyperonym field then serves as a hyperlink to the other record.
It is quite common in a database that a certain field can only contain one of a fixed set of values. Consider again the ‘lexical-entry table’ of a lexical database. One of the fields is ‘word class’. Each lexical entry belongs to one word class, and there is only a small set of word classes. In such a situation, the user wants a double functionality:
- He requires support in filling in the field: He wants to be reminded of the possible values, and he does not want to type the word class in every time, but wants to select the relevant word class from a combo list.
- He wants to insure consistency in the database: The conjunction should always be called ‘conjunction’ and not ‘connective’ or ‘conj.’ every once in a while, because otherwise a search that displays the set of all conjunctions, or a sorting of the database by word class, will fail.
Now the lexical-entry table could simply contain a column ‘word class’ which could be filled in freely just like, e.g., the lemma field. Then, however, the above two requirement could not be met. The solution, instead, is as follows: Set up a separate list containing only the word classes. That will be a small list with a dozen or so entries. Then make sure that whenever the field ‘word class’ of the ‘lexical-entry table’ is filled in, the permissible values are drawn from the word-class list.
The technical solution of this idea is different for the two database systems:
- In the relational database, there is a separate table whose records have two fields: a word-class ID and the name of the word class. The lexical-entry table has a column ‘word-class ID’ whose values point to the other table. The user does not see the IDs, but only the word-class names. That is programmed by associating the two columns with each other and defining a screen form which shows the content of the field ‘word-class name’ when, in fact, the neighboring field ‘word-class ID’ is being pointed to.
- In the free-field database, one defines a range set for a field that must contain only certain values. In the example, this is the list of word classes. The range set is stored in a text file distinct from the lexical-entry file and invoked whenever the field ‘word class’ of the lexical-entry file is filled in. The content of this field are literally the names of the word classes. The DBMS then checks whether the content of the field is one of the elements of the range set.
Whatever the technical solution, it is reasonable to make use of range sets whenever a database field contains elements of some theoretical vocabulary rather than free data. That applies not only to word classes, but also to morphological categories, to linguistic varieties such as sociolects and registers, word-formation patterns, semantic categories and many more concepts. One can either take such a category with its set of values from some linguistic theory and define the range set when setting up the database. Or one can induce the set from the data, adding and systematizing gradually the terms to be used in a certain field.
Normalization of a database
As long as one inserts free text in a field, one is liable to produce variation. Merida will sometimes be spelt Mérida, sometimes not; Bloomington will sometimes be Bloomington, IN, and so forth. Since such inconsistency hinders optimal exploitation of the database, the following procedure is a common occurrence in the development of databases:
- Separate out a certain category of data which appear literally together with other kinds of data either in one table or in a free-field-structure database T1
- transfer these data into a table T2 of their own while conserving their assignment to records of T1 in the form of pointers to IDs of those records of T1 that the data come from
- reduce multiple occurrences of data in T2 to one, levelling out (manually) variation among the tokens of a type, assigning each type its ID and seeing to it that the relation of each occurrence to an ID of T1 conserved in step 2 does not get lost
- set up a cross-table which maps the IDs of T1 on the IDs of T2 (now finally dispensing with the pointers of steps 2 and 3).
That procedure is called normalization of a database. Once you have normalized the field and created T2 (or the corresponding range set), the system will make you select the value to be filled in from T2 or the range set, and such variation as the above will no longer be possible.
Normalization is supported by modern DBMSs, but remains expendious. It is therefore preferable to invest more energy in initial design decisions, separating data into distinct tables of a relational database or setting up appropriate range sets.
The choice between the two kinds of database
Before one starts filling in the data in a database, one should spend some effort on designing its structure. Designing a relational database is much more complex and abstract that designing a free-field-structure database. Also, precisely because of the rigidity of the former and the flexibility of the latter, inadequate design decisions are more easily remedied in the latter than in the former. If therefore one knows that one's data are most appropriately stored in a database (if they are, for instance, lexical data, then there can be no doubt about it), but one has no experience with relational databases, then it is easier to begin with a free-field-structure database. After some time, this will have passed into a steady state. It is then not too difficult to remodel its structure in such a way that it can be imported into a relational database.
If, however, one knows from start that one will need the functionality of a relational database, or one has large quantities of data, but processing speed is an issue, then one should opt for a relational database. One should, however, be aware that by 2011, there is no relational DBMS designed for linguistic work on the market. There are a couple of things a linguist needs, such as
- defining an alphabet with digraphs and even trigraphs,
- defining a sort order for that alphabet,
- defining loose equivalence between characters (e.g. e ~ é ~ è) in order to succeed in searches for expressions whose orthography one does not quite remember,1
- differentiating all of this according to different languages,
- producing word lists and concordances.
All of this functionality is included in good free-field-structure software like Shoebox/Toolbox, but would have to be programmed for a relational database.
Filling in data
The most important maxim in storing data in a database is:
Never store the same data in more than one place!
For instance, there is a text corpus from which examples in the linguistic description are drawn; and there is a grammar and a lexicon that contain such examples. Never copy the examples literally into the grammar or the lexicon! Holding the same data in more than one place inevitably leads to inconsistency, for instance when one adjusts the orthography or the interlinear gloss.
The correct solution is: hold the data in one place – in the case at hand, store strings of text exclusively in the corpus – and insert links to pieces of those data in all the other places where they are referenced. The literal copy is only needed for one purpose: if the text containing the links is to be displayed or printed. Then it is a trivial problem to replace the links by a copy of the material they refer to.
The second important maxim is:
Normalize what can be normalized!
Suppose there is a field in a database whose content is text, but the set of different strings that can appear in the field is finite, and many records have identical content in that field. For instance, in an address database, the names of towns will recur. That is a situation for a range set, taking the form of a separate table in a relational database. The procedure that reduces the set of all the contents of the field in all the records of the database to such a range set is the normalization described above. Normalization is a step that costs work, but it always pays off in the long run.
1 what they call “phonetic search”, which precisely it is not