In the first of two blogs, Jeff Aronson considers the etymology of the word “data” and grammatical aspects of its usages, with the intention of discussing who owns data and collections of data.
I was recently verbally accosted (the word is not too strong) by a professor of computing science who demanded to know whether the word “data” is singular or plural. When I suggested the latter he asserted otherwise so aggressively that contradiction seemed unwise.
Here I examine the question using etymology, grammar, and usage as my tools. I shall then examine the more important question of who owns data.
When investigating the origins of words one is often led back to what one might call the mother tongues, a range of proto-languages, such as Proto-IndoEuropean and Proto-Semitic. These languages are themselves not known from records but have been deduced from the patterns of languages that are known, tracing them back to the hypothesized originals. English words can most often be traced back to Proto-IndoEuropean roots.
To understand how this happens, consider the IndoEuropean root DHE, the so-called e-grade form of the root, meaning to set or put down, to make or shape. Because vowels change readily when words develop, the e-grade form can become an o-grade form, DHO, or a zero-grade form, DHƏ, in which the final vowel is replaced by a neutral vowel sound called a schwa, after the Hebrew vowel of that name. The schwa, represented by an inverted e (ə or Ə) typically occurs in weakly stressed syllables, like the final a in “data” (/ˈdɑːtə/). These various forms can also have prefixes and suffixes and may be doubled (technically known as reduplication), giving rise to a myriad of words from a single root. DHE, for example, gives deed and misdeed, DHO gives do, doing, and done, DHEM gives deem and theme, and DHOM gives doom and words ending -dom, like kingdom and leechdom; there are many more.
Now take the IndoEuropean o-grade root DO and its zero-grade form DƏ, which means to give. In Sanskrit this gave rise to dadāmi and in Greek δίδωμι, both meaning I give. Note the reduplication of the root in both cases. This typically happens in verbs when the action is repeated—giving is supposed to be habitual. The Latin verb to give is dārē (in which both vowels are pronounced separately, as marked), which also reduplicates in the perfect tense as dedi, mimicking the repetition of a past action.
The past participle of the Latin verb dare is datum, meaning “given”, which then becomes a noun of neuter gender, meaning something that is given or is due to be given—a present, a debit, or a debt. The plural of “datum” is “data”. And that should be the end of it.
However, one should not be seduced by the etymology of a word (the etymological fallacy)—what it once meant does not necessarily tell you what it means now. English is not Latin and words mature with time. Consider, as an example, “agenda”. The Latin verb agere means to do, and its gerund, agendum, means something that needs to be or must be done. So agenda, the plural form, means things that need to be done. When the word was first used in English it implied a list of things to be done, a list of agenda, but with time the list of plural things just became a list called the agenda. Nowadays it can also mean a plan of some kind (as in a hidden agenda). Similarly, “stamina”, now a singular noun, arose from the plural of “stamen”, the thread of life spun by the Fates; the longer the thread the more stamina you had.
These analogies give insights into the problem, which depends on the difference between count nouns and non-count nouns:
- “agenda” is now a singular count noun (plural “agendas”);
- “stamina” is now a singular non-count noun (no plural).
So, is “data” a plural count noun (singular “datum”) or a singular non-count noun?
The grammatical problem in considering whether “data” is singular or plural arises from the fact that the singular form, datum, is generally used only in a technical sense to mean a baseline, benchmark, or reference point (as in datum level, datum line, datum mark, datum point). Although it can be used to mean a single piece of information, such usage is rare. On the other hand, “data” is used to mean either a whole lot of pieces of information (technically the plural of a count noun—one datum, many data) or a collection of such pieces (technically a non-count or mass noun—much data).
This is similar to the use of collective nouns, such as “board”, “cabinet”, or “government”, which are singular when they refer to a group but plural when they refer to the individual members of the group. Thus, when the Queen refers to “My Government” she uses the singular. Here is an example from her speech to Parliament in March 2018: “My Government is committed to peace in Northern Ireland …”. However, the plural would be appropriate in a sentence such as “The Government are at loggerheads over the question of Brexit.” If we regard “data” as a word of this type, we should use the plural when we have in mind some or all of the individual pieces of data (e.g. “some/all of the data suggest …”) and the singular when referring to the agglomeration (“en masse, the data suggests …”). Even in the latter case, however, the plural would not be amiss, and just as appropriate.
You can test whether you want to use the plural or singular by qualifying “data” with words such as “all” (“all the data are”), “many” (there are many data), and “much” (“much data supports”). Doing that will help you to decide whether you are thinking of the individual pieces of information or the whole collection or a discrete part of it.
The argument that my interlocutor from computer science, mentioned above, used is that “data” is obviously singular in compound nouns such as “database” and “databank”. His argument is flawed. There is a technical term for nouns that are formed by joining two nouns together; it is tatpurusha. The word is Sanskrit and literally means “his servant”, referring to the fact that the meaning of one part is subservient to the meaning of the other. A boathouse is a building in which boats are kept and a houseboat is a boat that functions as a dwelling; the subservience varies with the order of the words.
In tatpurusha the first element can be singular or plural, and the whole can refer to one or more objects. A boathouse can contain one boat or more than one; a clotheshorse is a frame on which clothes are hung. Some words are singular but have plural forms, such as trousers, but the quasi-singular form is often used in tatpurusha, as in trouser-press and trouser suit. When the first element has the same forms in both singular and plural you can’t tell whether it’s singular or plural, but it generally refers to a plurality of objects. For example, you don’t expect to see just one sheep in a sheep-pen or one deer in a deer park. So you can’t tell whether the occurrence of “data” in “database” or “databank” is singular or plural; each is a collection of individual pieces of data—usually lots of them.
In my next blog, I shall consider how the word “data” is used in bioscience publications, whether as a plural or singular noun, and consider who has a claim to ownership of data and collections of data.
Jeffrey Aronson is Associate Editor BMJ EBM, consultant physician and clinical pharmacologist, and Fellow of CEBM
Conflict of Interest: none declared
Read more in the Word about evidence series