Updating a spring data extraction and analysis pipeline

Please login to bookmark

Understanding human diseases and discovering new therapeutic solutions remain central challenges in biomedical research. Diseases are not caused by isolated factors, but by complex networks of genes, proteins, and phenotypes interacting in dynamic systems. Traditional research methods, which often analyse isolated elements, are insufficient for capturing this complexity. In response, the field of network medicine has emerged, using systems-based models to explore disease relationships and support drug repurposing (DR), a promising strategy that identifies new uses for existing drugs. Within this context, the DISNET (DISease understanding and drug repurposing through complex NETworks) platform was developed to construct a multilayer biomedical knowledge base. It integrates data from structured databases and unstructured online sources, which includes Wikipedia and Mayo Clinic, using Natural Language Processing (NLP) techniques (specifically, the MetaMap tool) to extract and validate phenotypic information. This phenotypic layer is then connected with biological and pharmacological layers, enabling researchers to explore disease relationships and DR hypotheses through a public API and web interface. However, despite its scientific relevance, the DISNET system faced technical issues by early 2023. Its core medical text extraction pipelines had stopped functioning, and the platform relied on outdated technologies including Java 8 and early Spring Boot versions. The microservices architecture, with 16 Docker containers requiring independent configuration, had become unmanageable for the current small development team composed by one or two active developers. Additionally, the web interface was partially non-functional, and the system’s documentation was obsolete. These issues compromised the platform’s reliability and functionality. This Master’s Thesis addresses these problems through the restoration and improvement of the DISNET platform. The primary goals were to restore the system’s core functionality, reduce technical complexity, and improve maintainability. Specific objectives included reactivating the Medical Term Extraction Process (MTEP) pipelines for Wikipedia and Mayo Clinic, upgrading the software stack, consolidating the system’s databases, and redesigning the web interface. An in-depth analysis of the DISNET architecture and design was conducted to understand the system’s complexity and to develop an improvement plan. The results of this work include a fully operational and simplified DISNET system. The extraction workflows now run reliably on the unified infrastructure, the number of containers has been reduced for ease of deployment, and the web interface has been improved with updated content. These improvements not only ensure that the system can be maintained by a small development team but also restore its role as a valuable, open-access research tool. With accurate and up-to-date phenotypic data, DISNET once again enables researchers to explore disease similarities, track knowledge evolution over time, and generate novel DR hypotheses.

Updating a spring data extraction and analysis pipeline

Continuar buscando...

Nueva Información Actualizada

Related posts: