Location

Cook Library 206Z & Room A

Presentation Type

Short Concurrent Session

Start Date

28-4-2023 3:00 PM

Description

There are many points of deposit for Columbia University's digital repository, Academic Commons. Content may be added by Columbia affiliates through a self-deposit form, by library staff through the cataloging backend (Hyacinth), and via SWORD deposit from entities such as library-hosted OJS, journal publishers, and others. As one might expect, after fifteen years of additions through these various channels, duplication happens! When faced with a corpus of nearly 40,000 records that must be reviewed, with duplicates remediated in three separate systems, how does one even start? This presentation will detail our approach to defining and scoping this problem, as well as the project workflows and technical solutions we utilized to remediate approximately 300 duplicate item records and 600 associated asset records. The presentation will include a brief walkthrough of the custom Python program developed and used to query our data and export meaningful subsets for remediation. (Technologies: Fedora, Solr, Rails, Python, DataCite)

Share

COinS
 
Apr 28th, 3:00 PM

Duplicates in the Repository: Remediation and Reconciliation in Three Systems, Including DataCite

Cook Library 206Z & Room A

There are many points of deposit for Columbia University's digital repository, Academic Commons. Content may be added by Columbia affiliates through a self-deposit form, by library staff through the cataloging backend (Hyacinth), and via SWORD deposit from entities such as library-hosted OJS, journal publishers, and others. As one might expect, after fifteen years of additions through these various channels, duplication happens! When faced with a corpus of nearly 40,000 records that must be reviewed, with duplicates remediated in three separate systems, how does one even start? This presentation will detail our approach to defining and scoping this problem, as well as the project workflows and technical solutions we utilized to remediate approximately 300 duplicate item records and 600 associated asset records. The presentation will include a brief walkthrough of the custom Python program developed and used to query our data and export meaningful subsets for remediation. (Technologies: Fedora, Solr, Rails, Python, DataCite)