{"componentChunkName":"component---src-templates-story-js","path":"/stories/2026-05-10-using-data-science-to-identify-future-collections-for-big-ten-open-books/","result":{"data":{"markdownRemark":{"html":"<p>The data refinement strategy executed by <a href=\"linkedin.com/in/yuruc\">biomedical engineering PhD candidate Yuru Chen</a> utilized data science and machine learning to transform messy bibliographic records into a strategic roadmap for future collections. It consisted of five main steps.</p>\n<ul>\n<li><strong>Multidimensional Deduplication:</strong> Before new subjects could be identified, the data required rigorous cleaning. The workflow analyzed 144 different types of identifiers, including title, author, ISBN, OCLC, and LCCN, using fuzzy matching logic to identify unique works. By applying thresholds for title similarity (>0.95) and author similarity (>0.86), the project condensed the massive dataset into a clean list of approximately 36,000 unique records.</li>\n<li><strong>Filling the Metadata Gaps with LLMs:</strong> Approximately 6.3% of the records (over 2,300 titles) were missing both LC and Dewey call numbers. The project integrated a Large Language Model to analyze these titles and descriptions, assigning them to appropriate standardized subject classifications and call number categories. This ensured that recent or under-cataloged works were not excluded from discovery.</li>\n<li><strong>The AI \"Scoring Engine\":</strong> A custom-built rule-based scoring engine was developed to classify titles into potential collections automatically. This tool combined three dimensions of analysis: a normalized keyword scan of titles and descriptions against subject-specific keyword lists, prefix matching against Library of Congress (LC) classifications, and prefix matching against Dewey Decimal classifications. It also incorporated embedding-based similarity scores. Titles received cumulative points for every match across these dimensions, allowing the project to rank their relevance to specific themes.</li>\n<li><strong>Semantic Discovery via Embedding Systems:</strong> To move beyond simple keyword matching, the project implemented an embedding system. This AI technique maps the semantic meaning of words, allowing the system to recognize that a book about \"carbon footprint\", \"renewable energy\", or “sustainability” belongs in the \"Environment\" collection even when those exact theme keywords aren't in the metadata.</li>\n<li><strong>Author_Iden.</strong> To further enrich these collections with scholarly context, a specialized software tool was developed to harvest missing author metadata and institutional affiliations from multiple sources including library authority records, ORCID, and institutional repositories. This tool identifies author names, resolves institutional affiliations across multiple variants and historical changes, and creates linked authority records (ISNI/VIAF) where available. This ensures that the scholars behind these works are properly recognized.</li>\n</ul>\n<p>This automated workflow fundamentally changed the scale of BTOB’s collection building. The discovery process shifted from identifying individual books to mapping 74 new thematic areas, each containing nearly 500 candidate titles that will be whittled down to 100 for each final collection. The top 14 of these will now be presented to the participating Big Ten university presses for their final title selection.</p>\n<p>The data-driven analysis also refined and expanded the \"candidate list\" for the three upcoming collections. While each final collection will only feature 100 curated titles, we now have more to choose from:</p>\n<ul>\n<li>Health Disparities and Disability Culture (491 titles)</li>\n<li>African-, Asian-, and Hispanic-American Experiences (499 titles)</li>\n<li>Human Environmental Impact (495 titles)</li>\n</ul>\n<p>Additionally, the project mapped titles in a press-by-theme matrix and identified which themes have the highest scholarly concentration among the Big Ten presses.  \"Labor and Work,\" \"Popular Music,\" and \"Theater and Film\" rose to the top. </p>\n<p>By integrating AI and data science into the collection development workflow, Big Ten Open Books is augmenting the expertise embedded in Big Ten Academic Alliance libraries and presses to act at scale and overcome the limitations of siloed data. Not only does the work identify exciting new opportunities for BTOB collections, but it may also provide a helpful model for organizations looking to revitalize scholarly backlists and make high-quality research more accessible.</p>","fields":{"storyImage":"/assets/image_432a2bad.png"},"frontmatter":{"title":"Using Data Science to Identify Future Collections for Big Ten Open Books","summary":"Thanks to University of Michigan Rackham Graduate School doctoral intern Yuru Chen, we have applied a data science and machine learning workflow to sort over 500,000 book records exported from Big Ten libraries to identify in-demand subject areas for future Big Ten Open Books collections. These topic areas are timely and have a \"critical mass\" of at least 500 candidate titles each. Contributing Big Ten presses will identify 100 titles in each subject collection that can find a new audience through open access.","date":"May 10th, 2026"}}},"pageContext":{"id":"1cd0c351-3837-507e-a6f5-27d88e9c7972"}},"staticQueryHashes":["1184796180","63159454"]}