Epistemological and practical perspective

The role of data in an empirical discipline may be viewed in two perspectives:

In an epistemological perspective, data are subservient to the epistemic interest pursued in the particular research, which is at the top of a teleonomic hierarchy. One first chooses an empirical problem to be solved. Second, one selects a theory as an appropriate framework for the solution sought. Only then does one decide which kind of data a solution can be based on. Finally, one selects a method to be applied to the data in order to find the solution.
In a practical perspective, data may preexist the research and demand scientific treatment. The purpose that such data may serve if properly processed then constitutes the epistemic interest. The data are then accepted just as they are. Finally, as in the epistemological perspective, a method is sought which leads from the data to the epistemic interest.

The two perspectives are characteristic for distinct research settings and, to a large extent, for different empirical disciplines.

The epistemological perspective is typically taken in natural sciences. Data of the desired kind may be sought in nature or may be generated by the researcher, for instance by experiments.
The practical perspective is typically taken in the humanities (to the extent they are pursued as empirical disciplines). Relevant data may be products of human beings which are not replicable and are – maybe just for this reason – of scientific interest.

Linguistic research may take either perspective; correspondingly, the process of selection of data takes a different form:

If the object area is a living language and the epistemic interest is of a general nature, i.e. it is not bound up with the uniqueness of a set of data, then the researcher is free to determine the kind of data needed. He may select them from among different kinds available. He can opt for written or oral data. He may generate data, natural data in the speech community or experimental data with speakers who serve as subjects.
The data may be preexistent. Sometimes they are unique, as when they are the extant corpus of an extinct language or the literary work produced by a deceased person. However, it also happens that the researcher does not have access to (members of) the speech community and must limit himself to available data.¹ Then he is confronted with a closed set of data, so he cannot expand the database. He respects the uniqueness of the available data and tries to make the best of them. The possibility of selection depends on the size of the corpus. If it is very large, as it is in the case of languages like Latin, Ancient Greek and Sanskrit, the researcher will select a segment, e.g. the prose variety of the classical period. On the other hand, the corpus may consist in a manageable set of inscriptions or even only a grammar or dictionary produced at a time when the language was still spoken. Then the question of selection does not arise.

Since epistemology is not concerned with practical circumstances, the practical perspective is not systematically taken on these pages. It goes without saying that whenever they suggest best choices, the researcher may not be able to take them because of practical limitations.

Semantics-driven vs. structure-driven data collection

The significative part of the language system may be described in an onomasiological or semasiological perspective. This distinction may be applied to data collection only mutatis mutandis. One may distinguish between semantics-driven and structure-driven data collection. However, as long as no data are available, they may be produced in a semantics-driven approach, i.e. with the research question how certain cognitive and communicative functions are fulfilled in the language. They cannot, however, actually be produced in a structure-driven approach, i.e. with the research question of what certain given forms and structures mean in the language, since such are ex hypothesi not given.

The idea behind these two approaches to data collection is explained in the respective sections. Here it is to be noted that they are neatly complementary. This means not only that they lead to results that complement each other. It importantly means that they must be combined because each of them pursued in isolation is incomplete and biased.

¹ Use of The Language Archive is a case in point.