Provenience of linguistic data

Data have the methodological function of being taken for granted in a research. In order to fulfill this function, they must originate outside the researcher; and although they may owe their sheer existence to the researcher (because he somehow elicited them), their particular properties must be independent of the researcher. Examples formed by the researcher are not data because they fail on this condition.

Generation of linguistic data

  1. Linguistic data may pre-exist the research. In this case, the researcher just records, copies or notes them down.
  2. They may be generated as part of the research.
    1. A speaker may be provided with a stimulus provoking him to generate utterances of some desired kind.
      • The stimulus may be a linguistic stimulus, e.g. a sentence of the lingua franca to be translated or a questionnaire to be filled in. This is the most common form of elicitation of linguistic data.
      • It may be a non-linguistic stimulus, e.g. a task to be performed or a request to produce a text on some topic. This is a common form of staged communication.
    2. A speaker may be confronted with linguistic data produced by somebody else, including the researcher, and may react to them by an informant judgement.

Relatively uncontrolled generation of data, as in #1 and #2a above, and relatively controlled generation of data, as in #2b above, complement each other in linguistic research. The former guarantees the availability of spontaneous, natural data that illustrate diverse linguistic varieties. The latter is needed to complete a systematic representation of the linguistic system.

Identification of the source of data

In the past, i.e. roughly up to the middle of the 20th century, standards for the identification of the source of linguistic data were rather sloppy. Standards used to be (and still are) relatively strict in the philologies. There a sample drawn from the literature has usually been provided with an indication of the author, work and section quoted. Only in publications intended for the inner circles, this could be omitted because sapienti sat.

In descriptive linguistics, the source of the data used to be left in the dark. Some grammars, including all of the colonial grammars (e.g. San Buenaventura 1684), make no mention of their sources at all. More modern grammars, e.g. Dixon 1972, at least list the informants in the preface, without, however, identifying the sources of the examples and even the texts. Only since about the last quarter of the 20th century, standards for identifying the sources of linguistic data have become stricter (Berez-Kroeker et al. 2018).

There are many good reasons why the source of every text and every example sentence in a linguistic description should be identified. It goes by itself for data quoted from published sources. It is also true for unpublished data:

All of this amounts to the following requirements:

For data used in a linguistic work, the following should be done in the published text: