Data Releases


A data release contains all archived data from the Archaeology of Reading project with the exception of images. In addition statistics derived from transcriptions and the way they were worked on are included.

The data is arranged in a file system hierarchy distributed as a BagIt package. Releases are numbered and named like where X is the number of the release.

Data format

The top level directory contains data concerning the collection as a whole. A README file gives a short explanation of the release and contains a changelog. Spreadsheets are included in UTF-8 encoded CSV format. XML schema and DTDs that were used when creating the transcriptions are also included.

| └── (schemas/DTDs)
| ├──PrincetonRB16th12/
| | ├── (XML transcriptions) EX: PrincetonRB16th12.aor.002r.xml
| | ├── PrincetonRB16th12.description.xml
| | └── PrincetonRB16th12.images.csv
| ├──PrincetonRB16th11/
| | └── (book contents)
| ├──UclCastiglione1541/
| | └── (book contents)
| ├──Newberry27495/
| | └── (book contents)
| ├──FolgersHa2/
| | └── (book contents)
| ├──PrincetonK6233/
| | └── (book contents)
| ├──HoughtonSTC11402/
| | └── (book contents)
| ├──PrincetonPA6452/
| | └── (book contents)
| ├──PrincetonU101/
| | └── (book contents)
| ├──PrincetonPA8550/
| | └── (book contents)
| ├──PrincetonDL45/
| | └── (book contents)
| └──PrincetonPE1137/
| └── (book contents)
| ├──books.csv
| └──commits.csv



The books spreadsheets lists books that are mentioned in annotations. The first column is used as the standard title of the book. The other colummns contain title variants and bibliographic information.


The corpus spreadsheet contains metadata about the books which are part of the project. In particular the identifier column gives the identifier of the book in the archive.


The locations spreadsheets lists locations that are mentioned in annotations. The first column is used as the standard location name.


The people spreadsheet lists people that are mentioned in annotations. The first column is used as the standard person name.


The books subdirectory contains a directory for each book in the collection. Each book directory contains metadata about the book, transcriptions, and the list of images in the book. Each file is prefixed by the book identifier which is also the name of the book directory. For example the directory PrincetonK6233 refers Princeton’s Paratitla and contains files like PrincetonK6233.description.xml.


The book description file contains metadata about the book in a custom XML format. A schema is in progress, but not yet available.


The images spreadsheet contains an ordered list of page images. The first column is the image identifier, the second is the image width, and the third is the image height. The page images are in reading order.


The IMAGE portion of the name is the image identifier from the images spreadsheet without the initial prefix of the book identifier and file extension. These files contain detailed information about annotations on the corresponding page in a custom XML format. See the Transcribers Manual [PDF] for more information.


The stats sub-directory contains statistics derived from our transcriptions as well as the history of our work on the transcriptions. It contains two subdirectories containing different data.

stats/latest/ sub-directory contains data regarding the current state of the transcription data. The data in this folder is described here.

book_totals.csv – Collects data from other spreadsheets for convenience. This spreadsheet includes total counts for each annotation type and word counts for each book.

BOOK_ID.csv – There is one of these spreadsheets for each book in the corpus. It contains similar data compared to book_totals.csv but accumulates data per page is restricted to a single book.

vocab_ANNOTATION-TYPE_LANGUAGE-CODE.csv – There is one of this type of file for each annotation type and for each language. Example: vocab_marginalia_EN.csv OR vocab_underline_LA.csv. These spreadsheets track the number of times words appear in the transcribed data.

Annotation types: marginalia, mark, symbol, underline, drawing, errata, numeral

Language codes follow the TWO letter abbreviation standard ISO 639-1. Common language codes in this project are: EN, EL, ES, FR, IT, LA.

stats/history/ sub-directory contains data describing the history of the transcription data. This involves describing the state of the data repository at different points in time. The data files in this directory are described here.

commits.csv – a spreadsheet giving details about a particular data commit to the repository. The author of the commit is described along with the number of files touched, the time stamp, and the parent commit. The parent commit is useful for reconstructing the branching nature of the data repository.

books.csv – this spreadsheet contains annotation and word counts for each book during each commit. The commit IDs used in this spreadsheet can be linked to the specific commit in commits.csv.


Data Release 1, November 24 2015: