Databases tailored for linguistic use allow the user to define a sort order of the records. This may then follow some practical requirements of the operator or be just the sort order usual for dictionaries.
In the macrostructure of most semasiological dictionaries, the entries are ordered according to orthographic criteria. This holds true whether the writing system is alphabetic or uses some other sign inventory. In an alphabetic writing system, the sort order of the entries in the macrostructure is usually alphabetical. This is true whether the order is forward or retrograde. Alphabetic order is, of course, specific to the language using the alphabet. It is not even necessarily the order of the letters in the alphabet. In the German writing system, e.g., <ä ö ü ß> are appended at the end of the alphabet; but in the sort order of dictionaries, they are inserted after the letters forming their base, thus <a ä ... o ö ...s ß ... u ü>.
Diacritics and two-level sort order
After this first criterion, a hierarchical order obtains regarding graphic variants of base letters, like <A> or <ã> or <á> as variants of <a>.
- Such a variant may constitute an independent unit of the alphabet at the same level of its base model. It then has an order position at the first level.
- Otherwise, it may be categorized as a graphic variant of its base model. It then has an order position at the second level.
This alternative may be decided on phonological grounds. For instance:
- If the variants represent different phonemes, they are assigned to the first level. Examples include vowel letters with dieresis (the infamous “umlauts”) in German or the Spanish <ñ>, all of have a first-level position in the sort order, following their respective base letters.
- If the variants represent the same phoneme, they are assigned to the second level. This concerns majuscules and minuscules in all writing systems using them. It also concerns vowels with the acute accent in Spanish and with the grave accent in Italian.
However, this phonological principle is not always observed. For instance, both in French and in Portuguese, <ç> is a second-level variant of <c> although they represent different phonemes; and the same holds for vowels provided by the tilde in Portuguese.
The two-level sort order is implemented according to the following rule: Given two words W1 and W2 which are homographous except that at position Li W1 has letter X while W2 has letter Y, then
- if Y is a first-level letter just as X, then W2 follows all words which share with W1 the initial substring L1 - Li.
- if Y is a second-level variant of X, then W2 follows W1 immediately.
To illustrate:
| If <é> is a first-level letter, the sort order is: | Beko Belo Béko |
| If <é> is a graphic variant of <e>, the sort order is: | Beko Béko Belo |
Digraphs
A digraph like <sh> or <ch> may be treated in the sort order in either of two ways:
- It may be treated as a single letter. It is then an individual member of the alphabet and is assigned its own position – usually following its first letter – both in the alphabet and in the sort order.
- It may be treated as a sequence of its two component letters, its phonological specificity being ignored. Then the sort order applies individually to the first and to the second component.
For the Spanish digraphs <ch> and <ll>, sort order obeyed principle #1 for two centuries. The sort order was accordingly carro - cuna - chacal. Since 1994, the sort order has obeyed principle #2, so it is carro - chacal - cuna (as it has always been in German).