Kinds of data

The phenomena that provide the heuristic point of departure for a discipline are only its ultimate substrate, which is not itself processed by the scientist. The data of a discipline are always representations of such phenomena. That is, they are essentially third-order entities by the ontology of naive realism. Thus, the ultimate substrate of linguistics is the total of actual linguistic activity occurring in the world. Linguistic data are recordings and representations of a set of such phenomena. Linguistic data may be classified by several parameters:

By the criterion of whether the data have a historical (spatio-temporal) identity or are stripped of it by abstraction, they are primary vs. secondary data.
Primary data are subdivided into processed data and raw data by the criterion of whether they are symbolically represented or not.
Cross-classifying with the former distinction, primary data are also subdivided into original recordings and derived representations by the criterion of whether the representation directly reflects the ultimate substrate or is derived from another representation.

This categorization of linguistic data types is visualized in the following schema:

To give some examples:

A video-recording of some communicative event is a piece of raw primary data in its original recording.
A transcription of an audio-recording is a piece of processed primary data in a derived representation.
The same goes for a copy of a written text in an electronic corpus.
Example sentences in a grammar or dictionary – assuming they are data at all – are secondary data.
Annotations of a text in a corpus are likewise secondary data.

For more discussion, s. Lehmann 2004.

Sources of data

Data have the methodological function of being taken for granted in a research. In order to fulfill this function, they must originate outside the researcher; and although they may owe their sheer existence to the researcher (because he somehow elicited them), their particular properties must be independent of the researcher. Examples formed by the researcher are not data because they fail on this condition.

Generation of linguistic data

Linguistic data may pre-exist the research. In this case, the researcher just records, copies or notes them down.
They may be generated as part of the research.

A speaker may be provided with a stimulus provoking him to generate utterances of some desired kind.
- The stimulus may be a linguistic stimulus, e.g. a sentence of the lingua franca to be translated or a questionnaire to be filled in. This is the most common form of elicitation of linguistic data.
- It may be a non-linguistic stimulus, e.g. a task to be performed or a request to produce a text on some topic. This is a common form of staged communication.
A speaker may be confronted with linguistic data produced by the researcher and may react to them by an informant judgement.

Relatively uncontrolled generation of data, as in #1 and #2a above, and relatively controlled generation of data, as in #2b above, complement each other in linguistic search. The former guarantees the availability of spontaneous, natural data that illustrate diverse linguistic varieties. The latter is needed to complete a systematic representation of the linguistic system.

Data vs. examples

In linguistics, an example is an expression in some language that has the function of illustrating a statement or concept. The statement or concept may be relatively abstract; the example then renders them more concrete and shows how they are meant to be operationalized, i.e. to be matched with linguistic phenomena. To this end, the example will prominently display exactly those features which are at issue in the statement or essential for the concept, contextualized to the extent necessary. Contrariwise, it will be stripped of such features which are immaterial to the argument and would only distract or confuse the reader who actually needs the example in order to understand.

Data and examples have different functions in linguistic discourse:

Data have a function in scientific methodology: they are used as evidence to back a scientific statement.
Examples have a function in didactic communication; they are used to illustrate a general statement or concept by a concrete case falling under it.

In principle, an example can be a piece of linguistic data. It will then usually be a piece of secondary data, since the spatio-temporal coordinates of a particular speech event are seldom relevant in an example. Secondary data are still data, i.e. objects which do not owe their properties to the person who uses them to back a claim. The author is not free to adapt a piece of data to better fulfill its function as an example and yet to present it as a piece of data. For this reason, many linguistic examples are not data, but made up by the author. This is typically the case in textbooks, but is also common in scientific treatises. Such examples are not linguistic data. This does not mean that they would not be allowable. A constructed example may fulfill its demonstrative function much better than any available piece of data. Again, if an author uses a linguistic expression as an example rather than for scientific proof, it makes no sense to require a source for the example.

Since the last third of the twentieth century, it has become more and more customary in linguistics to mark the difference between data and examples in some formal way. In particular, since the functions of scientific proof and of didactic illustration are often combined in a scientific treatise, an example may be marked for its status as a piece of data. To this end, it is provided with a reference to its source. The source itself is listed in detail, and the reader may check whether it represents actual speech events. Contrariwise, if an example is presented without such a reference, the reader assumes that it was made up by the author.

References

Lehmann, Christian 2004, "Data in linguistics." The Linguistic Review 21(3/4):275-310. [download]