The DLL Catalog and Technological Innovation

Machine-Assisted Searching and Cataloging

The process of searching for Latin texts and their associated information and processing them for the catalog is labor-intensive and time-consuming. That is why one of the research endeavors of the DLL project is the development of tools and methods for leveraging the power of machines and the best practices of Linked Open Data.

For example, creating an authority record manually takes a significant amount of time, since the information must be gathered from more than different internet resources. We have developed a number of methods and techniques for automating the process, with a view to balancing the speed and capacity of computers with the need to maintain control over the quality of the information. The first step in building an author authority record is now to search for the author’s record in the Virtual International Authority File (https://viaf.org/). The VIAF ID and permalink are added to a list of items to be processed by scripts for scraping data from the various fields on a VIAF page. Depending on the author, that process can return a significant amount of information, including birth and death dates and geographical information embedded in the authorized name form (e.g., “Paulinus, of Nola, Saint, approximately 353-431”). Further processing can extract that information and insert it into the appropriate fields. In this way, we can process dozens of records in a fraction of the time it would take to research and develop each record manually. But humans with expertise in the subject still need to review the work and clean up any inaccuracies before the records can be published and made available in the catalog.

We are also in the process of experimenting with deep learning techniques to speed up the process of reconciling individual records with their author and work authority records in the catalog.

The goal of this project is to speed up the process of adding information to the catalog so that it will be of ever more use. As with everything other project of the DLL, the code is available for reuse under an open license.

Open Source

The DLL Catalog operates entirely on open source technology. The content management system that runs the site at https://catalog.digitallatin.org is Drupal 10. Apache Solr drives the search feature. The tools we have developed for processing and preparing data for the catalog are available in the DLL's Github repository.

Linked Open Data

All of the data for the DLL Catalog, and the scripts for serializing them, are available for downloading in json-ld and CSV format.

All of the individual author authorities, work authorities, item records, and web page records are available in a dynamically-generated json-ld format. Simply append "?format=json-ld" to the end of an item's URL (e.g. https://catalog.digitallatin.org/dll-author/A4830?format=json-ld).

The DLL Catalog and Technological Innovation

About the Catalog

Machine-Assisted Searching and Cataloging

Open Source

Linked Open Data