Lemma

A lemma (from the Greek noun lẽmma ‘topic, headword’) is a dictionary entry. Both of these terms are ambiguous in the same way, as they mean

the headword of a dictionary entry,
the dictionary article, i.e. the entire text starting with one headword and ending before the next one.

We will stipulate here that lemma has meaning #1, while entry has meaning #2. Thus, a dictionary entry consists of two parts:

the lemma,
the information on the lemma.

The Latin equivalent to Greek lemma is vox ‘expression, word’. It survives in the expression sub voce (‘under the headword’, abbr. s.v.), which is used (instead of page numbers) in bibliographical references to lexicon articles (e.g. “see Encyclopaedia Britannica s.v. lexicography”).

Lemmatization

Lemmatization is the process which creates the set of lemmas of a lexical database. It is conceived as starting from text-words found in a corpus and leading to lemmas heading dictionary entries. Lemmatization proper is the first step of a process comprising two steps:

Word forms are paradigmatically and syntagmatically related to basic forms (representing lexemes), which serve as lemmas. This process is called lemmatization.
A subset of the basic forms arrived at in the first step is selected for inclusion in the dictionary. This process is called lemma selection.

In many specialized dictionaries, e.g. in a dictionary of the French cuisine, the lemmas are stipulated in a discretionary way. We will here concentrate on the general dictionary and the linguistic principles of lemmatization.

The lowest level of any variation is the type-token relation. The linguistic units occurring in texts are tokens of types. Tokens that are “identical” by criteria that an automaton can apply are subsumed under one type. By these criteria, this web page contains a number of tokens of the type ‘encyclopedia’. Moreover, this page contains a token of the type ‘Encyclopaedia’ [excluding this attempt to represent the type], which differs from the former type in two letters. This kind of variation is frequent in texts. In the present case, one wants to assign both word forms to one lemma. This shows that while the type-token relation is basic to corpus analysis, the relation of a text-word to a dictionary lemma is not a type-token relation; it is more abstract.

Practical procedures

Morphological lemmatization
word form	lemma
a	a
all	all
are	be
as	as
C	C
concordance	concordance
corpus	corpus
each	each
forms	form
in	in
information	information
is	be
list	list
occurring	occur
of	of
on	on
saving	save
taken	take
text	text
the	the
there	there
tokens	token
type	type
types	type
where	where
word	word

Lemmatization comprises the following routine steps:

Transform the text corpus into a word list.
Create a concordance of the corpus, i.e. of all the items of the word list as they occur in the corpus.
On the basis of the concordance, assign the word-forms to their lemmas.

A word list of text corpus C is a list of all the word forms (taken as types) occurring in C (saving the information of where in C there are tokens of each type). It is the lexicographer's task to exhaustively lemmatize that word list: each word form must be assigned to a lexeme (and, thus, to a lemma). For the first sentence of the current paragraph, the result would be as shown in the table on the right.

The analysis of morphological variation, i.e. the assignment of word forms to lexemes, may be aided by a program for automatic lemmatization. The analysis of syntactic and semantic variation is aided by the production of concordances. Word lists and concordances are executed automatically by Shoebox/Toolbox.

Much of the traditional routine of lemmatization depends on the contingent fact that most of the inflectional variation found in most European languages takes place in desinences. It should be born in mind that there are languages, e.g. the Celtic ones, where much inflectional variation takes place at the left word boundary. Here, automatic lemmatization reaches a higher level of complexity.

Syntagmatic unity of the lemma

Looking back at the notion of ‘lexicon’, we see that the lexicon comprises units of all the complexity levels, from the morpheme up to the phrase (phraseologism). All of the following units are stored in the English inventory:

-s 3rd Ps Sg, -s Nominal Plural, -ness Property Abstractor, pre- ‘before’, cran- ‘vitis-idaea’, ...
be, will, his, ...
study, lexicography, just, for, fun, ...
functional linguistics, type-token relation, moot point,...
beat a dead horse, throw out the baby with the bathwater, just for fun, ...
Good morning, He who pays the piper calls the tune, ...

Most dictionaries opt for concentrating on level 3, i.e. what Jane Doe understands by the notion ‘word’. Dictionaries devoted to the linguistic system (rather than encyclopedic dictionaries) will include items of level 2. Depending on their scope, dictionaries will include items of level 4 to some extent. And some like moot point will hardly serve as a lemma. Finally, units of levels 5 and 6 will never have the status of lemma. Those of level 5 may be included under some relevant word lemma. For instance, beat a dead horse may be lemmatized under horse. Proverbs like the one concerning the piper and the tune will simply lack from a general dictionary. They may be listed in specialized dictionaries.

Dependent morphemes are neglected or treated unsystematically in general dictionaries. Some like pseudo- and tele- may crop up as lemmas, but others like cran- and -less are not normally granted that status. This appears to be a tradition that is not founded on a sound reason (cf. Antal 1963). Particularly in the case of little-described languages, there will be many users who are not knowledgeable either in the structure of word forms of the language or in the European customs of lexicography. A dictionary is complete and useful only if it comprises a morphemicon (inventory of morphemes). Dependent morphemes are then lemmatized with a hyphen at the bound side.

Complex expressions present problems both for lemmatization and for selection. The latter problem will be treated in the section on selection. The lemmatization problem is the appropriate form of the lemma. Consider an example: Walking distance itself will be a lemma in a comprehensive dictionary, but what about the phrase within walking distance?

Is it a lemma by itself, registered after the lemma within in the alphabetical order?
Or is it a subentry under the lemma walking distance (or under the lemma within)?

The former solution is abhorred by most dictionaries. However, if one of the words contained in the phrase is chosen as lemma, then the problem of the proper choice of the relevant component arises: will beat a dead horse be lemmatized under beat or under dead or under horse?

Another kind of complex expression that is usually listed under a simple expression is phrasal verbs. Thus, take on, take up, take down are all listed under the lemma take. This kind of microstructure is called nesting. Many of these lexicographic devices are due to the alphabetical arrangement of a general print dictionary and to hypotheses on where the user may try to find an expression. These problems essentially disappear in an electronic dictionary, since sort order does not play a role there. And the fact that take up etc. are related to take in various ways will be brought out by the content of appropriate fields of the respective entries, in this case, trivially, by the shape of the lemmas themselves.

Paradigmatic unity of the lemma

Lemmatization involves reduction of variation of various kinds:

Orthographic variation, e.g. encyclopedia ~ encyclopaedia under encyclopedia. Here the lemma commonly represents a standard, while the variants are considered deviations. Some of the orthographic variation is rule-governed, as the variation between color ~ colour, neighbor ~ neighbour etc., some is idiosyncratic, as the variation between heron and hern.
There are two kinds of morphological variation: Allomorphy leads to such variant forms as keep ~ kep (in kept) as allomorphs of the morpheme keep, or -ed ~ -t (in beeped and kept, resp.) as allomorphs of the morpheme Past. Inflection leads to such variant forms as make ~ makes ~ made ~ making all under make, goose ~ geese under goose. The lexeme of an inflection paradigm is represented in some citation form, which is stipulated in the lexicographic tradition of the language in question.
In morphological variation, inflection must be distinguished from word-formation. Lemmatization must recognize inflected forms and reduce them to the respective lexeme and citation form. Derived forms, however, must not be reduced to their base. They may constitute lemmas of their own, or they may be nested under their base-lemma.
Semantic variation, e.g. wing ‘body part of insect or bird’ ~ wing ‘lateral part of a building’ all under wing ‘lateral extremity of something’. Here, the senses of a polysemous item are considered variants of the general meaning of the lemma, in contradistinction to homonymous items, which are granted their own lemma.

That variation which is rule-governed need not be taken up in the entry list. That is to say, in a dictionary of American English, neither do we need a reference entry colour that refers us to color, nor do we need to list, under the lemma color, the form colour as a variant. Instead, the appropriate main section of the dictionary, in this case the orthography section, will formulate the rule and provide a couple of examples. The same goes for regularly inflected forms and for such metaphors that follow productive metaphor schemata.

On the other hand, irregular variants are noted once in every lexicon and twice in a print dictionary:

They are, at any rate, noted in the lexical field ‘variants’ or in the field containing irregular inflection forms or in the field defining related senses.
In addition, irregular orthographic and morphological variants receive the status of reference lemmas in a print dictionary.

Reference entries

Lemmatization is a decision in favor of one form of an expression which is considered its (proper) citation form, and against all the other forms which are not. This decision is principled in many cases; in other cases it is more or less discretionary. However, even in the principled cases the dictionary user cannot be expected to know the lemmatization rules by heart, and much less can he be expected to guess the motives behind discretionary decisions. Moreover, many dictionaries, esp. bilingual ones, are destined for users with imperfect knowledge of the language in question, who are not expected to know that worse is a form of bad. Therefore, in order not to frustrate the user who searches a certain expression in the dictionary, reference entries are set up. Such an entry reduces to a (pseudo-)lemma and a reference to the appropriate lemma. For instance:

worse: see bad.
encyclopaedia: see encyclopedia.

No substantive information on the item bad/worse is provided under the reference entry worse; it is all reserved for the entry bad (where, of course, the irregular comparative is mentioned).

Reference entries are a concession to the user which counteracts the principles of lemmatization. They are necessary in print dictionaries. In electronic dictionaries, such variants are integrated in the entry of the standard lemma, as follows:

orthographic variants are enumerated in the field ‘orthographic variants’;
irregular inflected forms are enumerated in the field ‘inflection paradigm’.
semantically irregular senses are defined in the field ‘meaning’.

If the user executes a lemma search in an electronic dictionary, the algorithm will look up not only the lemma field, but also the fields ‘citation form’, ‘orthographic variants’ and ‘inflection paradigm’. If the user executes a semantic search, the algorithm will look up not only the primary sense, but all of the senses of an entry. In that way, the two-step procedure necessary in a print-dictionary is settled in one step:

the user finds the information that what he had in mind is a variant of some standard form;
at the same time, he is presented with the standard lemma and all the information associated with it and with the variant that he had in mind.

Literature

Hausmann 1977, ch. 2