Investigating Properties of Mined Programs from Science Publications

Location

CELA & Mary Church Terrell Library, First Floor

Document Type

Poster - Open Access

Start Date

4-25-2025 12:00 PM

End Date

4-25-2025 2:00 PM

Research Program

Partially funded by NSF Grant SES-2326175.

Abstract

In many scientific fields, writing software is an integral part of the research process. Software developed for scientific research has unique properties which differentiate it from code written by professional software engineers. Existing studies focus on how scientists write code, but to our knowledge, there are no analyses of the code itself. We introduce a dataset of public, permissively-licensed software repositories sourced from papers published by the predominantly biology-focused, open-access publisher PLOS. We employ software mining techniques to extract all possible links to software from each paper and currently curate the GitHub repositories. We present preliminary results on the attributes of the repositories, including licensing and choice of programming language, by paper and by research area. By sharing this dataset, we hope to facilitate further research on how scientists develop, use, and share software.

Keywords:

Mining software repositories, Open source software, Scientific programming

Notes

Presenters: Ajai Khatri Nelson and Trung Nguyen

Major

Computer Science
Mathematics

Project Mentor(s)

Molly Q Feldman, Computer Science

2025

This document is currently not available here.

Share

COinS
 
Apr 25th, 12:00 PM Apr 25th, 2:00 PM

Investigating Properties of Mined Programs from Science Publications

CELA & Mary Church Terrell Library, First Floor

In many scientific fields, writing software is an integral part of the research process. Software developed for scientific research has unique properties which differentiate it from code written by professional software engineers. Existing studies focus on how scientists write code, but to our knowledge, there are no analyses of the code itself. We introduce a dataset of public, permissively-licensed software repositories sourced from papers published by the predominantly biology-focused, open-access publisher PLOS. We employ software mining techniques to extract all possible links to software from each paper and currently curate the GitHub repositories. We present preliminary results on the attributes of the repositories, including licensing and choice of programming language, by paper and by research area. By sharing this dataset, we hope to facilitate further research on how scientists develop, use, and share software.