Talking data
Over the last decade or so, it has become commonplace to talk about data sets in relation to humanistic research. Whereas data sets seemed to be intrinsically linked to and part of the natural sciences, humanists of various plumage now regularly create their own. Increasingly, humanities data sets do not only contain quantitative data, such as the information amassed by economic historians, but qualitative data as well. The very process of capturing qualitative data enables scholars to study particular historical phenomena from a quantitative perspective too. Yet, I would argue, the availability of quantitative data should not replace our more ‘traditional’ methodologies which are so finely attuned to studying and understanding qualitative phenomena.
For even when qualitative information becomes quantifiable, we should not discard the richness and depth of qualitative sources; it would make little sense to conflate the fluctuations in interests rates in the early modern period with the number of times wills of early modern Catholics did or did not invoke Christ, Mary, and the saints. Even though both phenomenon are quantifiable and can be expressed in numbers, these numbers reveal entirely different dimensions of the historical past. Nor would it make sense to present the number of times poets like Joost van den Vondel or Shakespeare used the word ‘mother’ in one of their plays as a fact as such – for what does it tell us? The use and meaning of the word mother only can be grasped when taking into account various factors, including the syntax of the sentence, the meaning of other words which surround this word, and stylistic conventions. Having said that, quantitative approaches to primary historical sources of whatever kind can be very useful; ultimately, such approaches enable us to discern patterns that are not immediately obvious. Such patterns are often difficult to detect in traditional, analogue research environments. Yet the detection of such patterns should facilitate a movement ad fontes, offering a new perspective through which we can view and study our cherished primary sources. Indeed, we should strive to marry qualitative and quantitative analyses, opening up new dimensions and raising new research questions that are difficult to conceive of and pursue outside digital environments.
How does our own data set relate to all of the above? Twice during the first phase of AOR (2014–6) we numerically broke down our data set, both in relation to the creation of internal reports for our funder, the Mellon foundation. Both documents can be found here and here. One of the things we immediately realised when analysing the AOR data set is its small size: currently, even now work on the AOR2 corpus already has commenced, all the XML transcriptions amount to less than 40 megabytes. This certainly is not the size which enables one to boldly walk into a conference room to shout “my data set is bigger than yours”. (As a note on the side, humanists should refrain from doing this anyway, since the data sets of virtually everyone working outside the humanities are larger than ours.) Although size matters to a certain extent (in our case: the larger the data set, the more transcriptions it includes, and the more users will be able to find and discover), what really matters is the actual data of which a data set consists and the way in which it is captured and structured, for this determines the ways in which we can interrogate our data and what we can get out of it.
The structure of our data reflect the various types of reader interventions we encountered in the books in our corpus, such as marginal annotations, symbols, marks, drawings, tables, and graphs. This division, which closely mirrors the actual annotations practices of the readers on which we focus, makes it possible for our users to search within and across reader interventions. The AOR search widget, the child of the heroic efforts of our programmers Mark Patton and John Abrahams, contains an advanced search functionality which makes it possible to create complex, query-based searches, a powerful way of interrogating the AOR data. The juxtaposition of AOR data (the transcriptions as well as the search results) to the digital surrogates of the annotated books, facilitates an easy and intuitive movement ad fontes. In such a way, we can reap the fruits of working with humanities data while having the primary sources (or their digital surrogates) ready at hand.
Another way in which we can approach and, in a way, dissect our data set is through statistical analysis. Our thinking about the application of such an analysis started with rather mundane questions such as: “if Harvey mentions Caesar in a marginal note, which other words do frequently appear next to it?” In order to streamline such an analysis, we decided to formulate a number of concepts groups, thematic groups which include words which relate to the same, often rather broad theme, such as war, king/kingship, mind, soul, body, action, et cetera (Harvey’s own system of astrological symbols, which denote more abstract concepts, was actually really helpful in designing these concept groups). The concepts groups consists of words which together appear with a certain frequency (in order to yield statistically significant results) and include (equivalent) words of the two languages which dominate Harvey’s marginal notes, Latin and English. In generating these groups, we could make use of the lists with words and the frequency with which they appear, which are part of our recurrent data releases.
Although the formulation of concept groups was within our power, we quickly realized that none of us master the specialist knowledge and skills to actually subject our data to rigorous statistical analysis. Hence we decided to employ some professional statisticians. Once the data was made ready for analysis, an interesting process in itself which I will address in a separate blog, the statisticians started their work in earnest, mainly aiming to see whether there are any statistically significant correlations between concepts groups. In order words, do we see words which are, for example, part of the concept group ‘mind’, often appearing with words which belong to the concept group ‘body’? This makes it possible to discern links between certain topics Harvey addressed throughout his marginal notes. Although at this point in time the results of the statistical analysis are tentative, partly because our data set is slightly lopsided due to books which focus on war and strategy (Livy, Frontinus, and Machiavelli), the insights one can gain by a statistical approach are already evident. Throughout AOR2 we will continue applying statistical analysis to our data, making use of the inclusion of the second reader, John Dee. As we envisaged, results from the data analysis, even when only partial, immediately forces one to go back to the primary sources: even if there is a correlation between one or more concept groups, the individual instances of this correlation need to be tracked down and studied within the larger context of a marginal note or a set of marginal notes on a page in a particular book.
Lastly, interesting things can be done when relating or connecting our data set to other existing data sets that float around on the web. So far many data sets exist on their own, rather isolated from their peers. Over the last couple of years, the concept of Linked Open Data has rapidly become popular, as a motley crew of people comprising web architects, data curators, scholars, and scientists feel the need to link their data to that of others. This is not always a straightforward process at all, and in the future a separate blog will be devoted to it. Regardless of the challenges, the possibilities of linked data have captivated the mind of the AOR team. Already during AOR1 we realised that readers were often moving outward, for instance by referring to other books, some of which are not in our digitized corpus or are simply no longer extant. Moreover, some of our books are classical, canonical texts, such as Livy’s History of Rome, and digital editions, including translations, exist (Perseus), and it would be neat to see whether we can directly link to these editions. The concept of Linked Open Data also influences our thinking about the way we capture our data: how to transcribe the astronomical data in the Dee annotations in such a manner that it facilities an easy exchange with already existing astronomical data sets? These are just some examples of where establishing links with other data sets and digital corpora might be rewarding. Creating such links, in particular to other primary sources such as early modern annotated books, will therefore be one of the main activities of AOR2. For this moves us closer toward representing and recreating the intellectual cosmos and larger information culture early modern readers and their books were a fundamental part of.