A corpus (Lehmann 2007[D]) is a collection of texts of a certain category that is accessible as a self-contained whole. The following properties are relevant to the notion and may define different kinds of corpora:


For the researcher, the corpus is a means toward some end. The approach of the philologist and of the linguist to a corpus typically differ in this respect:

At any rate, a corpus is a set of processed primary data. Producing such a set of data is a scientific task in itself. Other scientists may then rely on and use such data.


Corpora may be set up for many different purposes. If the purpose is to document a language, one will try to produce what may be called a general-purpose corpus. This is a corpus that aims to represent the band-width on all the dimensions of variation in a language. This may be obtained by taking a theory of the speech situation as the point of departure, varying all the parameters involved systematically and crystalizing text genres as specific combinations of values on these parameters. This is attempted on the table ‘Parameters of speech situation and text genres’ (Lehmann 2001, §5.2).

In the collection of data for the purpose of describing a language, a text corpus enjoys a priviledged role. Natural texts in a corpus present linguistic data in their context. A corpus is, thus, closer to representing the ultimate substrate of linguistics than other kinds of linguistic data.

Given that there are two principal ways of obtaining linguistic data, viz. from a corpus of texts or by elicitation of data from native speakers, there was at times a controversy among descriptive linguists concerning the priority of one of these. Since the 1980s, there has been a notion that only natural texts are a reliable data basis, while elicitation is unreliable. In conclusion of this debate, it may be stated that neither of the two methods of data collection is suffient alone; they must be combined.

Comparative corpora

For certain purposes of general-comparative linguistics including typology, it is useful to have translation equivalents of a given text in different languages. The bible is an often-used example; another one are the many translations of Le petit prince by Saint-Exupéry. Beside their undeniable advantages, such sets of texts have the disadvantage of being translations of an original. A translation is less spontaneous than its original, and its style may not be representative of natural texts of the language.

A method of overcoming this disadvantage is to have speakers of different languages tell a narrative about the same subject matter, e.g. a silent movie that they watched. The pear stories (Chafe [ed.] 1980) are a case in point. The degree of comparability of the texts so produced depends on many factors; but in general the method has proved fruitful for the comparison of strategies put to work in the solution of specific tasks of cognition and communication.