Location
Cook Library 206Z & Room A
Presentation Type
Short Concurrent Session
Start Date
28-4-2023 3:00 PM
Description
There are many points of deposit for Columbia University's digital repository, Academic Commons. Content may be added by Columbia affiliates through a self-deposit form, by library staff through the cataloging backend (Hyacinth), and via SWORD deposit from entities such as library-hosted OJS, journal publishers, and others. As one might expect, after fifteen years of additions through these various channels, duplication happens! When faced with a corpus of nearly 40,000 records that must be reviewed, with duplicates remediated in three separate systems, how does one even start? This presentation will detail our approach to defining and scoping this problem, as well as the project workflows and technical solutions we utilized to remediate approximately 300 duplicate item records and 600 associated asset records. The presentation will include a brief walkthrough of the custom Python program developed and used to query our data and export meaningful subsets for remediation. (Technologies: Fedora, Solr, Rails, Python, DataCite)
Duplicates in the Repository: Remediation and Reconciliation in Three Systems, Including DataCite
Cook Library 206Z & Room A
There are many points of deposit for Columbia University's digital repository, Academic Commons. Content may be added by Columbia affiliates through a self-deposit form, by library staff through the cataloging backend (Hyacinth), and via SWORD deposit from entities such as library-hosted OJS, journal publishers, and others. As one might expect, after fifteen years of additions through these various channels, duplication happens! When faced with a corpus of nearly 40,000 records that must be reviewed, with duplicates remediated in three separate systems, how does one even start? This presentation will detail our approach to defining and scoping this problem, as well as the project workflows and technical solutions we utilized to remediate approximately 300 duplicate item records and 600 associated asset records. The presentation will include a brief walkthrough of the custom Python program developed and used to query our data and export meaningful subsets for remediation. (Technologies: Fedora, Solr, Rails, Python, DataCite)