Investigating Properties of Mined Programs from Science Publications
Location
CELA & Mary Church Terrell Library, First Floor
Document Type
Poster - Open Access
Start Date
4-25-2025 12:00 PM
End Date
4-25-2025 2:00 PM
Research Program
Partially funded by NSF Grant SES-2326175.
Abstract
In many scientific fields, writing software is an integral part of the research process. Software developed for scientific research has unique properties which differentiate it from code written by professional software engineers. Existing studies focus on how scientists write code, but to our knowledge, there are no analyses of the code itself. We introduce a dataset of public, permissively-licensed software repositories sourced from papers published by the predominantly biology-focused, open-access publisher PLOS. We employ software mining techniques to extract all possible links to software from each paper and currently curate the GitHub repositories. We present preliminary results on the attributes of the repositories, including licensing and choice of programming language, by paper and by research area. By sharing this dataset, we hope to facilitate further research on how scientists develop, use, and share software.
Keywords:
Mining software repositories, Open source software, Scientific programming
Recommended Citation
Khatri Nelson, Ajai; Nguyen, Trung; Kim, Huyen; and Feldman, Molly Q., "Investigating Properties of Mined Programs from Science Publications" (2025). Research Symposium. 10.
https://digitalcommons.oberlin.edu/researchsymp/2025/posters/10
Major
Computer Science
Mathematics
Project Mentor(s)
Molly Q Feldman, Computer Science
2025
Investigating Properties of Mined Programs from Science Publications
CELA & Mary Church Terrell Library, First Floor
In many scientific fields, writing software is an integral part of the research process. Software developed for scientific research has unique properties which differentiate it from code written by professional software engineers. Existing studies focus on how scientists write code, but to our knowledge, there are no analyses of the code itself. We introduce a dataset of public, permissively-licensed software repositories sourced from papers published by the predominantly biology-focused, open-access publisher PLOS. We employ software mining techniques to extract all possible links to software from each paper and currently curate the GitHub repositories. We present preliminary results on the attributes of the repositories, including licensing and choice of programming language, by paper and by research area. By sharing this dataset, we hope to facilitate further research on how scientists develop, use, and share software.
Notes
Presenters: Ajai Khatri Nelson and Trung Nguyen