To the extent that structure-driven collection of linguistic data is at all possible, it uses linguistic stimuli.

Exploiting a text corpus

The most important method of this type is the exploitation of a corpus of texts produced spontaneously or at least not targeted towards the solution of a particular linguistic (i.e. cognitive-communicative) problem. This method therefore presupposes the recording of a corpus of utterances and texts which represent the various speech situations and genres which are traditional in the society.

Example sentences recorded for the purpose of illustrating the use of a lexical item, and thus typically displayed in the lexical entry of the item in question, are a rich source of grammatical data. For the purpose of extracting the grammatical constructions and formatives they contain, they have almost the status of spontaneous data, as they were not elicited with these grammatical features in mind, but show them by coincidence. If the description of the language comprises both a grammar and a dictionary, these will then share a set of examples, which may be an advantage in itself.

The texts are segmented, the units are identified and classified by the application of methods of substitution and permutation. Major class words are inflected through their various morphological categories in order to establish morphological paradigms, and transformations are applied to sentences in order to establish syntactic paradigms. In this way, the levels, categories, syntagmatic relations and paradigms of structural grammar are established, much as structuralist schools of the first half of the 20th century from L. Bloomfield to Ch. Hockett used to teach.

Collecting roots

If the purpose is to establish the complete inventory of the roots of a language, the following method can under certain conditions achieve it. The main condition is that the phonotactics for roots be relatively simple and regular. Assume the formula for the composition of verb roots is C1VC2. With some simplification, this is actually the case in Colonial Yucatec Maya, where the method was applied (Swadesh et al. 1970). Then one generates the full set of possible verb roots by inserting each consonant of the system in the positions of C1 and C2 and each vowel in the position of V. If the language has 20 consonants and 5 vowels, this produces 20 x 20 x 5 = 2000 possible roots. (If phonotactic constraints are taken into account, they will actually be less.) Now read out each element of the set to a native speaker and ask him what it means. Only a subset, of course, of the possible roots will be actual roots. However, under the conditions stated, the method guarantees a complete inventory of verb roots.

While the method is – always under the conditions stated – linguistically sound, it has some obvious drawbacks. Apart from the tediousness of the check, the main problem is a morphological one. The method produces bare roots. If it is an isolating language, a native speaker may recognize a root in isolation. Otherwise, the morphology of the language must be very simple and regular so the researcher may generate a standard inflected form on the basis of the invented root and check this. With more complicated morphology, the approach becomes unfeasible.