Linguistic data

Linguistic data have a set of properties that linguists are interested in and that they find more or less natural to attribute to the data. Among these are:

Sentence_i is grammatical/ungrammatical.
Sentence_i is meaningful/senseless.
Sentence_i is appropriate/inappropriate as an utterance in a certain situation.
Sentence_i means the same as / something different from sentence_j.
The meaning of sentence_i is X.

All of the properties #a – #e are basic to linguistic analysis, and the necessity of deciding the alternatives will here be taken for granted. Given a corpus of infinite size and variety, all of them would be decidable by appropriate methods of corpus analysis. Given the limitations of actual corpora, asking informants for their judgement is often the only feasible way of actually deciding them. The linguist must, however, be aware that most of these properties of linguistic units are of no interest to the layman, that the latter has no clear concept of them (they are tricky enough for the linguist)¹ or may entertain concepts that differ from those of the linguist. The attitude of the layman to linguistic material differs from the professional linguist's attitude in at least two respects:

Every piece of linguistic behavior is pragmatically embedded in a linguistic and extralinguistic context. It is not an isolatable object of itself, but is exclusively understood as a means to solve a certain cognitive and communicative problem in a given speech situation or text.
Moreover, the lower the grammatical or semiotic level of the linguistic unit in question, the less is it accessible to consciousness and reflection; the lowest-level items (allophones and allomorphs) are normally processed subconsciously and automatically. It takes a phonetic and linguistic formation to isolate these units for reflection. See also ‘Bewußtsein von Sprachwandel’.

From #1 it follows that the layman has no concept of linguistic structure, no matter whether it is phonological, grammatical or semantic structure. The object of reflection for him are not “system sentences”, but utterances in their pragmatic context. Consequently, some of the items #a – #e are more easily accessible to the layman, while others presuppose in him just that linguistic awareness and methodological skill the linguist himself needs in resolving such issues.

Alternative #c concerns the appropriateness or acceptability of a linguistic unit. According to #1, it is the easiest of all. It suffices to reframe the issue in such terms as “Imagine a certain situation; then would you say so-and-so?”, and any competent speaker may be expected to come up with a relatively reliable decision.

As for property #a (grammaticality), it is a corollary of #2 that the conformation of some linguistic unit out of lowest-level units is exempt from conscious motivation; and instead it simply belongs (or does not belong) to the language system. Consequently, the layman can tell with relative certainty whether such a unit does or does not correspond to his linguistic usage. This contrasts with his approach to higher-level units such as sentences. These are by definition not part of the language system, and instead they arouse in the layman the question of motivation, which, in the linguist's view, leads him astray into issue #c.

This means that the layman may with some certainty pass judgement on the grammaticality of low-level linguistic units, e.g. word-forms. However, given #1, a valid judgement cannot be obtained from him by confronting him with some such unit U in isolation and asking for its grammaticality. Instead, U must be embedded in a suitable context. This again complicates linguistic methodology, because if the utterance containing U is not accepted, this may be due to a problem of U or a problem of the framing context or of the combination of both.

Some brands of structural linguistics rely on the availability of secure judgements of type #a on sentences. All relevant empirical investigations have led to the result that above a minimum degree of grammatical complexity, there is enormous variation in informants' (including importantly linguists') grammaticality judgements (Schütze 1996). It is now recognized that getting grammaticality judgements from informants on linguistic units that are not part of the language system is just not a reliable scientific method. Moreover, it may be doubted whether the grammaticality or otherwise of a sentence such as The picture of himself did not appeal to John is of any scientific interest at all.

#b (meaningfulness) may be approached via #c by using sentence_i as an utterance in appropriate contexts. If #c is positive, then #b is positive. If no context for sentence_i may be found that makes #c positive, then #b is negative.

#d (syonymy) is much more difficult. For two linguistic units to be synonymous implies for them to be in free variation (see explanation of synonymy). While free variation may be ascertained rather easily for lower-level units because possible contexts may be readily classified, it is impossible to verify for higher-level units such as sentences because their sense depends on properties of the individual speech situation, and these are next to unclassifiable. The typical working situation here is that two sentences seem prima facie synonymous to the informant (or even the analyst) until he has found a discriminating context. The latter, however, may never happen. Thus, the hypothesis of the full synonymy of two linguistic units is falsifiable, but not verifiable. Therefore the solidity of a positive answer to alternative #d depends on an unfailing linguistic intuition.

#e (meaning) is the most difficult of all. For the assertion to be sensible, X must be a semantic representation of sentence_i. Some linguists may be able to produce one; certainly no layman ever is. If one at all wants to involve an informant in what is properly the task of the linguist, there may be workable ways of lowering one's requirements. Among these are:

Ask for paraphrases of sentence_i. The meaning of sentence_i may then be in the intersection of the meanings of the paraphrases. This is methodologically useful only if the paraphrases are in some sense simpler than sentence_i.
Single some items out from sentence_i, especially a certain word, and ask for its meaning. Again, the informant's answer is likely to be a set of near-synonyms; and it remains for the linguist to figure out what they have in common.
Draft an analysis in terms of semantic features / component predicates and ask the informant whether these are comprised by the meaning of sentence_i. The problems in resolving this issue are as for #d.

A frequently occurring case is the following: The language has a morphological paradigm composed by values which are in opposition. For instance, the language has a simple past like Span. trabajé ‘I worked’ and a compound past like he trabajado ‘I have worked’. On first encounter, the researcher has no idea what the difference between them is. It is then completely useless to ask an informant about it – how could he know? In this particular case, it would not even help to embed the verb form in a complete sentence. Some informants are very imaginative in inventing possible semantic differences between such two forms. The only sound way is to gather occurrences of the forms in question in texts and to analyze their distribution.

Needless to say, the same goes a fortiori for judgement #f:

The structure of sentence_i is Y.

As remarked before, to do such things with an informant can only be an interim replacement for a semasiological analysis to be done by the linguist. In the best of cases, the investigator is a better linguist than his informant, and then it would be naive for him to expect the informant to do the linguist's work. Informant's analyses may be of heuristic value for the linguist, but they can never replace the linguist's analyses. And this is not a matter of the use of linguistic terminology. It is the level of reflection on linguistic data and the use of scientific methods which is entirely the scientist's responsibility, which he cannot delegate.

¹ It was once thought that knowing the answer to such questions is an essential component of a native speaker's competence in his language. This is just not so; see Lehmann 2007.