A lemma (from the Greek noun lẽmma ‘topic, headword’) is a dictionary entry. Both of these terms are ambiguous in the same way, as they mean
We will stipulate here that lemma has meaning #1, while entry has meaning #2. Thus, a dictionary entry consists of two parts:
The Latin equivalent to Greek lemma is vox ‘expression, word’. It survives in the expression sub voce (‘under the headword’, abbr. s.v.), which is used (instead of page numbers) in bibliographical references to lexicon articles (e.g. “see Encyclopaedia Britannica s.v. lexicography”).
Lemmatization is the process which creates the set of lemmas of a lexical database. It is conceived as starting from text-words found in a corpus and leading to lemmas heading dictionary entries. Lemmatization proper is the first step of a process comprising two steps:
In many specialized dictionaries, e.g. in a dictionary of the French cuisine, the lemmas are stipulated in a discretionary way. We will here concentrate on the general dictionary and the linguistic principles of lemmatization.
The lowest level of any variation is the type-token relation. The linguistic units occurring in texts are tokens of types. Tokens that are “identical” by criteria that an automaton can apply are subsumed under one type. By these criteria, this web page contains a number of tokens of the type ‘encyclopedia’. Moreover, this page contains a token of the type ‘Encyclopaedia’ [excluding this attempt to represent the type], which differs from the former type in two letters. This kind of variation is frequent in texts. In the present case, one wants to assign both word forms to one lemma. This shows that while the type-token relation is basic to corpus analysis, the relation of a text-word to a dictionary lemma is not a type-token relation; it is more abstract.
word form | lemma |
---|---|
a | a |
all | all |
are | be |
as | as |
C | C |
concordance | concordance |
corpus | corpus |
each | each |
forms | form |
in | in |
information | information |
is | be |
list | list |
occurring | occur |
of | of |
on | on |
saving | save |
taken | take |
text | text |
the | the |
there | there |
tokens | token |
type | type |
types | type |
where | where |
word | word |
Lemmatization comprises the following routine steps:
A word list of text corpus C is a list of all the word forms (taken as types) occurring in C (saving the information of where in C there are tokens of each type). It is the lexicographer's task to exhaustively lemmatize that word list: each word form must be assigned to a lexeme (and, thus, to a lemma). For the first sentence of the current paragraph, the result would be as shown in the table on the right.
The analysis of morphological variation, i.e. the assignment of word forms to lexemes, may be aided by a program for automatic lemmatization. The analysis of syntactic and semantic variation is aided by the production of concordances. Word lists and concordances are executed automatically by Shoebox/Toolbox.
Much of the traditional routine of lemmatization depends on the contingent fact that most of the inflectional variation found in most European languages takes place in desinences. It should be born in mind that there are languages, e.g. the Celtic ones, where much inflectional variation takes place at the left word boundary. Here, automatic lemmatization reaches a higher level of complexity.
Looking back at the notion of ‘lexicon’, we see that the lexicon comprises units of all the complexity levels, from the morpheme up to the phrase (phraseologism). All of the following units are stored in the English inventory:
Most dictionaries opt for concentrating on level 3, i.e. what Jane Doe understands by the notion ‘word’. Dictionaries devoted to the linguistic system (rather than encyclopedic dictionaries) will include items of level 2. Depending on their scope, dictionaries will include items of level 4 to some extent. And some like moot point will hardly serve as a lemma. Finally, units of levels 5 and 6 will never have the status of lemma. Those of level 5 may be included under some relevant word lemma. For instance, beat a dead horse may be lemmatized under horse. Proverbs like the one concerning the piper and the tune will simply lack from a general dictionary. They may be listed in specialized dictionaries.
Dependent morphemes are neglected or treated unsystematically in general dictionaries. Some like pseudo- and tele- may crop up as lemmas, but others like cran- and -less are not normally granted that status. This appears to be a tradition that is not founded on a sound reason (cf. Antal 1963). Particularly in the case of little-described languages, there will be many users who are not knowledgeable either in the structure of word forms of the language or in the European customs of lexicography. A dictionary is complete and useful only if it comprises a morphemicon (inventory of morphemes). Dependent morphemes are then lemmatized with a hyphen at the bound side.
Complex expressions present problems both for lemmatization and for selection. The latter problem will be treated in the section on selection. The lemmatization problem is the appropriate form of the lemma. Consider an example: Walking distance itself will be a lemma in a comprehensive dictionary, but what about the phrase within walking distance?
The former solution is abhorred by most dictionaries. However, if one of the words contained in the phrase is chosen as lemma, then the problem of the proper choice of the relevant component arises: will beat a dead horse be lemmatized under beat or under dead or under horse?
Another kind of complex expression that is usually listed under a simple expression is phrasal verbs. Thus, take on, take up, take down are all listed under the lemma take. This kind of microstructure is called nesting. Many of these lexicographic devices are due to the alphabetical arrangement of a general print dictionary and to hypotheses on where the user may try to find an expression. These problems essentially disappear in an electronic dictionary, since sort order does not play a role there. And the fact that take up etc. are related to take in various ways will be brought out by the content of appropriate fields of the respective entries, in this case, trivially, by the shape of the lemmas themselves.
Lemmatization involves reduction of variation of various kinds:
That variation which is rule-governed need not be taken up in the entry list. That is to say, in a dictionary of American English, neither do we need a reference entry colour that refers us to color, nor do we need to list, under the lemma color, the form colour as a variant. Instead, the appropriate main section of the dictionary, in this case the orthography section, will formulate the rule and provide a couple of examples. The same goes for regularly inflected forms and for such metaphors that follow productive metaphor schemata.
On the other hand, irregular variants are noted once in every lexicon and twice in a print dictionary:
Lemmatization is a decision in favor of one form of an expression which is considered its (proper) citation form, and against all the other forms which are not. This decision is principled in many cases; in other cases it is more or less discretionary. However, even in the principled cases the dictionary user cannot be expected to know the lemmatization rules by heart, and much less can he be expected to guess the motives behind discretionary decisions. Moreover, many dictionaries, esp. bilingual ones, are destined for users with imperfect knowledge of the language in question, who are not expected to know that worse is a form of bad. Therefore, in order not to frustrate the user who searches a certain expression in the dictionary, reference entries are set up. Such an entry reduces to a (pseudo-)lemma and a reference to the appropriate lemma. For instance:
worse: see bad.
encyclopaedia: see encyclopedia.
No substantive information on the item bad/worse is provided under the reference entry worse; it is all reserved for the entry bad (where, of course, the irregular comparative is mentioned).
Reference entries are a concession to the user which counteracts the principles of lemmatization. They are necessary in print dictionaries. In electronic dictionaries, such variants are integrated in the entry of the standard lemma, as follows:
If the user executes a lemma search in an electronic dictionary, the algorithm will look up not only the lemma field, but also the fields ‘citation form’, ‘orthographic variants’ and ‘inflection paradigm’. If the user executes a semantic search, the algorithm will look up not only the primary sense, but all of the senses of an entry. In that way, the two-step procedure necessary in a print-dictionary is settled in one step:
Hausmann 1977, ch. 2