The challenge of … page numbers

Posted on November 13, 2018 By Jaap Geraerts

We have arrived in the last months of AOR, which is scheduled to come to an end in late January. This means that we are busy with getting the next version of our digital research environment ready. Part of the remaining work is very exciting and comprises the creation of a set of contextual documents on John Dee, his books that are in the AOR corpus, and his library. To that end, a humanities meeting was recently held at Princeton’s Institute of Advanced Studies (see Neil’s blog for more information). Necessary but unfortunately somewhat less exciting is the work that takes place in the trenches: checking transcriptions, hunting down bugs, and testing the viewer. This blog post addresses a problem which has haunted us over the course of AOR and which we had to face head on last month.

The problem is a rather mundane one: displaying the correct page numbers of a particular book in the AOR viewer. Getting this right proved to be labour intensive, while the problem itself shows the tension between the ways in which a (digitized) book as an object is viewed from distinct humanistic and technical perspectives. The problem emerged due to a combination of the particular technical infrastructure inherited from earlier projects conducted at Johns Hopkins University (JHU) and our transcription policy. Based on earlier work on the Roman de la Rose and the Christine de Pisan projects, the digital images that are ingested into the JHU archive are labelled according to the sequence of manuscript, starting with 1r, followed by 1v, and so forth. This folio numbering system both reflects the objects on which these projects focused, medieval manuscripts, as well as the conceptualization of objects within IIIF viewers as a sequence of images. However, anyone vaguely acquainted with early modern imprints will now that the page number system of these objects in general is much more complex because of the combination of page numbers and signatures, with some books having separate sequences of page numbers and/or signatures for individual sections.

As a result of the internal system of attributing page (i.e. folio) numbers, a mismatch between the page numbers visible on the digital images and the page numbers displayed in the bottom of the viewer emerged, as shown on the image below.

One way to overcome this hurdle was to include information about the page number and signature in the XML transcriptions generated for the project and to display this information in the viewer. We managed to get this working in the new version of the viewer – which currently is under construction and not yet publicly available – but quickly realised that this problem continued to persist in a number of cases. This was caused by our decision not to create XML files of all digital images, but only of those that contain one or more reader interventions, that is, any visible interaction of a reader with that page. Since at least around half of the pages in the books which are included in the AOR corpora are not annotated, XML files for these pages do not exist. As a result, for the pages which do not have an XML file associated with them, the viewer returns to its internal numbering system, creating an extremely awkward combination of page numbers and/or signatures and folio numbers (see below).

Several possible solutions existed, including the creation of a XML transcription for every digital image in the AOR corpora. This would mean having to create several thousand XML transcription which, although the transcriptions themselves would be small, would take in inordinate amount of time. The other, more feasible option was to create spreadsheets for every book in the AOR corpora which comprise information about the file name of the digital image and the information about page numbers and/or signatures (in the absence of the former) contained in the XML transcriptions. Luckily the junior programmer working on AOR, John Abrahams, was able to generate these spreadsheets, which meant that I ‘only’ had to enter manually the information regarding the page numbers of the digital images which do not have a transcription associated to them.

Nevertheless, I still had to go through 35 spreadsheets, check the information provided in them, and add/amend data where necessary. This process was further complicated by the mismatch between the file names of the images to which we refer in our transcriptions and the file names that are used internally in the JHU archive. These internal file names, created during the process of ingesting the digital images into the JHU archives, were provided in the spreadsheets. Based on the data in the transcriptions, I had to match the two sets of file names and then add the correct information to the spreadsheets. While being motivated by the thought that finishing this work would constitute an important step towards beatification, and spurred on by listening to football shows and a healthy dose of death metal, I did manage to populate all the spreadsheets. Although the work itself was everything but interesting, the final result is all the more pleasing. The information provided in the new version of the AOR viewer now matches the page numbers and/or signatures of the early modern imprints. Apart from avoiding confusion, this makes navigating these annotated books and doing research in the AOR digital environment much easier. Moreover, during the process of going through the spreadsheets, I encountered a couple of bugs (e.g. transcriptions being associated with the wrong images), constituting an additional fruit of this work. Apart from being a heroic story of sacrifice, perseverance, and the divine combination of football and metal, this blog shows the extent to which the spade work, done by both by computer engineers and humanists, forms the rock upon which this digital resource is built.

Reply Cancel reply

You must be logged in to post a comment.

The Archaeology of Reading

The challenge of … page numbers

Reply Cancel reply