SONAD: An Approach for Automated Software Name Disambiguation in Scientific Publications

Bookmark (0)
Please login to bookmark Close

Accurate identification and linking of software mentioned in academic papers to their corresponding software repositories are critical for reproducibility, citation accuracy, and effective knowledge management. However, due to naming ambiguities and incomplete citations, accurately identifying the correct repository URL for software references can be challenging. This work addresses the problem of software mention disambiguation starting with experiments with different metadata that can be used in a context of software mentions, as software name, keywords, synonyms and authors of the paper of the mention, as well as different similarity measures, to at the end develop a supervised machine learning approach leveraging metadata extracted from scientific publications and candidate repositories from GitHub, PyPI, and CRAN. The approach involved collecting mentions of software from a previous CZI hackathon and additional sampling of CZI, extracting related metadata such as paper author names, keywords, synonyms of the software mention, and programming language from the surrounding paragraph, and retrieving candidate URLs from relevant repositories. Similarity measures between paper metadata and repository metadata—including string-based and embedding-based techniques—were computed and subsequently used as features in a supervised classification model to predict if the URL is the match. The evaluation of the model demonstrated a precision of 82%, a recall of 89%, and an F1 score of 85%, highlighting its effectiveness in accurately linking software mentions with the appropriate repositories. Comparative experiments with large language models (llama-3.1-8b-instant, qwen-qwq-32b, gemma2- 9b-it and deepseek-r1-distill-llama-70b) that used raw metadata without similarity computations yielded lower precision and F1 scores, despite high recall. Specifically, the best-performing LLM (gemma2-9b-it) obtained a precision of 68%, recall of 97%, and an F1 score of 80%. Thus, the proposed similarity-based supervised approach outperformed all evaluated LLMs, emphasizing the effectiveness of leveraging explicit similarity metrics for accurate software disambiguation. These results demonstrate that metadata-based supervised approaches can effectively resolve software name ambiguities, improving scholarly communication and software discoverability in research contexts. The final product of this work can be found as a Python package PyPI, the full source code is available on GitHub, as well as on Zenodo.

​Accurate identification and linking of software mentioned in academic papers to their corresponding software repositories are critical for reproducibility, citation accuracy, and effective knowledge management. However, due to naming ambiguities and incomplete citations, accurately identifying the correct repository URL for software references can be challenging. This work addresses the problem of software mention disambiguation starting with experiments with different metadata that can be used in a context of software mentions, as software name, keywords, synonyms and authors of the paper of the mention, as well as different similarity measures, to at the end develop a supervised machine learning approach leveraging metadata extracted from scientific publications and candidate repositories from GitHub, PyPI, and CRAN. The approach involved collecting mentions of software from a previous CZI hackathon and additional sampling of CZI, extracting related metadata such as paper author names, keywords, synonyms of the software mention, and programming language from the surrounding paragraph, and retrieving candidate URLs from relevant repositories. Similarity measures between paper metadata and repository metadata—including string-based and embedding-based techniques—were computed and subsequently used as features in a supervised classification model to predict if the URL is the match. The evaluation of the model demonstrated a precision of 82%, a recall of 89%, and an F1 score of 85%, highlighting its effectiveness in accurately linking software mentions with the appropriate repositories. Comparative experiments with large language models (llama-3.1-8b-instant, qwen-qwq-32b, gemma2- 9b-it and deepseek-r1-distill-llama-70b) that used raw metadata without similarity computations yielded lower precision and F1 scores, despite high recall. Specifically, the best-performing LLM (gemma2-9b-it) obtained a precision of 68%, recall of 97%, and an F1 score of 80%. Thus, the proposed similarity-based supervised approach outperformed all evaluated LLMs, emphasizing the effectiveness of leveraging explicit similarity metrics for accurate software disambiguation. These results demonstrate that metadata-based supervised approaches can effectively resolve software name ambiguities, improving scholarly communication and software discoverability in research contexts. The final product of this work can be found as a Python package PyPI, the full source code is available on GitHub, as well as on Zenodo. Read More