Data Releases


A data release contains all archived data from the Archaeology of Reading project with the exception of images. In addition, statistics derived from transcriptions and the way they were worked on are included.

The data is arranged in a file-system hierarchy distributed as a BagIt package. Releases are numbered and named like where X is the number of the release.

Data format

The top-level directory contains data concerning the collection as a whole. A README file gives a short explanation of the release and contains a changelog. Spreadsheets are included in UTF-8 encoded CSV format. XML schema and DTDs that were used when creating the transcriptions are also included.

| └── (schemas/DTDs)
| ├──PrincetonRB16th12/
| | ├── (XML transcriptions) EX: PrincetonRB16th12.aor.002r.xml
| | ├── PrincetonRB16th12.description.xml
| | └── PrincetonRB16th12.images.csv
| ├──PrincetonRB16th11/
| | └── (book contents)
| ├──UclCastiglione1541/
| | └── (book contents)
| ├──Newberry27495/
| | └── (book contents)
| ├──FolgersHa2/
| | └── (book contents)
| ├──PrincetonK6233/
| | └── (book contents)
| ├──HoughtonSTC11402/
| | └── (book contents)
| ├──PrincetonPA6452/
| | └── (book contents)
| ├──PrincetonU101/
| | └── (book contents)
| ├──PrincetonPA8550/
| | └── (book contents)
| ├──PrincetonDL45/
| | └── (book contents)
| └──PrincetonPE1137/
| └── (book contents)
| ├──books.csv
| └──commits.csv



This spreadsheet, which is included only in the September 2018 and January 2019 data releases, contains all the data in the marginal notes. The information in column K refers to the book ID. The corresponding titles of the books can be found in the corpus.csv. Please note that this file is not part of the September 2018 data release. The most up-to-date version of the corpus spreadsheet is part of the final data release or can be accessed via the downloads page (under ‘Authority files’). The information in column I refers to internal image IDs. The files containing the internal image IDs and the corresponding pages in the AOR viewer can be accessed via the downloads page (under ‘Authority files’). Please also note that part of the image IDs are visible in the URLs of the images in the AOR viewer.


The books spreadsheet lists books that are mentioned in annotations. The first column is used as the standard title of the book. The other columns contain title variants and bibliographic information.


The corpus spreadsheet contains metadata about the books that are part of the project. In particular, the identifier column gives the identifier of the book in the archive.


The locations spreadsheet lists locations that are mentioned in annotations. The first column is used as the standard location name.


The people spreadsheet lists people that are mentioned in annotations. The first column is used as the standard person name.


The books subdirectory contains a directory for each book in the collection. Each book directory contains metadata about the book, transcriptions, and the list of images in the book. Each file is prefixed by the book identifier, which is also the name of the book directory. For example the directory PrincetonK6233 refers to Princeton’s Paratitla and contains files like PrincetonK6233.description.xml


The book description file contains metadata about the book in a custom XML format. A schema is in progress but not yet available.


The images spreadsheet contains an ordered list of page images. The first column is the image identifier, the second is the image width, and the third is the image height. The page images are in reading order.


The IMAGE portion of the name is the image identifier from the images spreadsheet without the initial prefix of the book identifier and file extension. These files contain detailed information about annotations on the corresponding page in a custom XML format. See the Transcribers Manual [PDF] for more information.


The stats sub-directory contains statistics derived from our transcriptions as well as the history of our work on the transcriptions. It contains two subdirectories containing different data.

stats/latest/ sub-directory contains data regarding the current state of the transcription data. The data in this folder is described here.

book_totals.csv collects data from other spreadsheets for convenience. This spreadsheet includes total counts for each annotation type and word counts for each book.

BOOK_ID.csv : There is one of these spreadsheets for each book in the corpus. It contains similar data compared to book_totals.csv but accumulates data per page and is restricted to a single book.

vocab_ANNOTATION-TYPE_LANGUAGE-CODE.csv – There is one of this type of file for each annotation type and for each language. Example: vocab_marginalia_EN.csv OR vocab_underline_LA.csv. These spreadsheets track the number of times words appear in the transcribed data.

Annotation types are: marginalia, mark, symbol, underline, drawing, errata, numeral

Language codes follow the TWO-letter abbreviation standard ISO 639-1. Common language codes in this project are: EN, EL, ES, FR, IT, LA.

stats/history/ sub-directory contains data describing the history of the transcription data. This involves describing the state of the data repository at different points in time. The data files in this directory are described here.

commits.csv This spreadsheet gives details about a particular data commit to the repository. The author of the commit is described along with the number of files touched, the time stamp, and the parent commit. The parent commit is useful for reconstructing the branching nature of the data repository.

books.csv.This spreadsheet contains annotation and word counts for each book during each commit. The commit IDs used in this spreadsheet can be linked to the specific commit in commits.csv.

Data Release 4.0 (September 2018)

The structure and file names of this data release differ from the other data releases. Initially, this data release was created for internal purposes, yet because its data was used for the statistical analysis, we decided to make it available online. Apart from the different structure of the ZIP file, the major difference is in the file names, which are based on the internal book IDs. These IDs are to be found in the corpus spreadsheet that is part of the final data release or can be accessed via the downloads page (under ‘Authority files’).