Variation is a basic fact about language and has to be accounted for by every linguistic theory. One of the very first issues in designing a research project is how one is going to deal with the kinds of variation that appear in the data. We will first discuss variation in the data and then theoretical approaches to variation.

Variation in the data

The general object of any linguistic description is a certain language.¹ Naturally, no single scientific investigation can hope to grasp a language in its entirety. Even if the topic is restricted to a very specific problem of the language system, e.g. the structure of the relative clause, the object language still has to be delimited by reference to the dimensions of variation of language use (diachronic, diatopic, diastratic, diaphasic).

Here the task of the linguist is janus-headed:

Delimit the variety to be described explicitly against everything else. For instance, if all your data comes from one dialect of the language, then that dialect, and not the entire language, is the object of your study.
See to it that the remaining variation allowed for by your definition and actually occurring in the variety you are describing is adequately represented in your data.

The twofold principle demands that the researcher diversify his database so that it exhibits all the variation possible within the scope of his concern. On the other hand, if one has not been able to represent the factual variation adequately in one's data, or if variants have crept into the data that belong to varieties that had been excluded in the delimitation of the object, then data representing varieties that one cannot responsibly account for may be eliminated. For instance, it is often practically impossible to obtain data from all the dialects of the object language. On the other hand, as a consequence of mobility, the speech community providing the informants and the data may be heterogeneous in terms of provenience, and, thus, of dialects spoken. Consequently, in analysing the data, it may be necessary to throw out data stemming from other dialects.

If one wants to hedge one's empirical generalizations statistically, one must obtain a sample of data that is statistically representative. For a sample to be representative means that every element in the population has the same chance of being included in the sample. The safe way of achieving this is to take a random sample. This is often not possible for practical reasons. Then one may structure the sample along a parameter that is of practical importance and that may prove influential; and one then takes a sample of equal size for each of the values of that parameter. For instance, if I study Shakespeare's language, I might take a random sample from all of his works. The more practical, and probably more interesting, alternative is to take 100 running lines from each of five selected pieces and code all of the data as to which of the pieces they stem from. One may then observe the variation in each of the five samples, calculate the statistical parameters for them, and by comparing these (technically, by calculating the standard error), one may ascertain the probability that the five pieces one had selected are representative of the population, i.e. Shakespeare's work (technically, that any other five Shakespearian pieces would have yielded the same statistical mean).

Similarly, if one strives for representativeness of one's data of a language, one may try to achieve it for those parameters that one knows, i.e. that one knows the possible values of. For instance, if I want my generalizations to be valid both for the spoken and for the written variety of a language, I take two samples, one of each variety.

In linguistics, the proportion that each of the possible parameter values occupies in the population is practically never known. For instance, it may seem relevant to represent the spoken and the written variety of a language proportionately according to their quantitative share in everyday communication. However, this notion is impossible to operationalize. Consequently, representativity of linguistic data practically never means statistical representativity. In practice, one must be content to have the existent varieties represented in the corpus at all. A proposal for a general purpose corpus is found on another page.

Variation in linguistic theory

Any empirical generalization is a statement about a principle obtaining in the variation. Since linguistic activity is goal-directed, it obeys a teleonomic hierarchy. Consequently, any generalization abides at a certain level of abstractness. It ascertains something which, at that level, may appear as an invariant, but which at a higher level of the hierarchy may just be a variant means for a higher goal.

Consider color terminology as an example. Starting with Berlin & Kay 1969, there was much comparative linguistic research in basic color terminology. It found out that the basic color terms of a language form a system structured by a couple of implicational generalizations which permitted setting up a hierarchy of basic color terms. This, in turn, was related to the physiology of color perception.

Now the existence of basic color terms in languages is an empirical fact which is not deducible from any theory. There might be a language that lacks basic color terms. According to Levinson 2000, Yélîdnye is such a language. Now the issue is what to make of the methodological situation reached by that finding. There was a set of generalizations on basic color terminology that were thought to represent a universal about human language. Now these generalizations have been falsified in a certain respect. However, just as it was wrong to mistake the generalizations for universals, so it would be wrong to now discard them as worthless because they were falsified. Instead, what we have is a set of languages – apparently the vast majority – that follows one principle, providing basic terms for a set of colors, and another set of languages – apparently the far minority – that designates color perceptions by reusing words from other semantic areas. The two alternative principles are obviously variants subordinate to an even more abstract principle having to do with color perception and its linguistic representation. This principle remains to be found. If the scientist finds out that no generalization is possible at a certain level, he does not throw in the towel or think he has found an important argument for linguistic relativism or particularism, but instead he learns that his principle was not abstract enough and that he must look for a higher principle.

More discussion on theoretical aspects of linguistic variation is to be found in the website ‘Sprachtheorie’, section 11.

¹ Otherwise the work is not descriptive, but comparative, so the variation is interlingual.