An Embedding-Based Machine Learning Solution for Medical Concept Mapping
Barros, V., Paradinha, R., Almeida, J. R., & Oliveira, J. L. (2025). An Embedding-Based Machine Learning Solution for Medical Concept Mapping. In Proceedings - IEEE Symposium on Computer-Based Medical Systems.
Abstract
The integration of heterogeneous clinical datasets represents a fundamental challenge in contemporary biomedical research, particularly when reconciling multi-language and multi-institution data sources. The challenge of this procedure lies in the effort required to map the original concepts with their standard definitions. Various automated mapping solutions can assist researchers in this process, but the complexity grows when handling multi-language datasets, resulting in substantial manual work for translation and mapping. In this paper, we proposed a novel framework for clinical concept harmonisation that leverages vector-based embeddings and semantic search methodologies to enhance interoperability in multi-cohort studies. The methodology incorporates comprehensive data profiling, ontology-driven concept alignment, and machine learning-based vector search within a unified architecture. We demonstrate the efficacy of this approach through practical application to Alzheimer's disease (AD) research datasets from distinct institutions with different languages, achieving effective cross-lingual concept mapping while maintaining compatibility with established standardisation frameworks. © 2025 IEEE