A perspective for combining LLMs, Ontologies, and Knowledge Graphs in the Biomedical Domain
I want to thank Vlasta Kus for the feedback on the initial idea of this article.
Introduction
Named Entity Disambiguation (NED) is an essential task in Natural Language Processing (NLP) for resolving ambiguous mentions of named entities to their corresponding unambiguous entities in a reference knowledge base.
The key idea of NED is to map a continuous span of text, such as “Type 2 diabetes”, to a ground-truth entity, such as the “Type 2 Diabetes Mellitus (CUI C0011860)” located in a medical knowledge base such as the “Unified Medical Language System” (UMLS). The role of NED is particularly relevant in critical domains, including biomedical ones, because detecting precise information with high accuracy is fundamental to making the right decisions at the appropriate time.
Large Language Models (LLMs) are machine learning models capable of learning patterns and relationships from a vast amount of textual data and, based on this accumulated and compressed knowledge, they are able to generate human language text. However, because of their inherent limitations, LLMs are ineffective in tasks demanding precise and detailed human language comprehension, such as disambiguating named entities. Moreover, some of their well-known limitations have an even more significant impact due to the peculiarities of the scenario we are considering:
- Hallucination: false or misleading information presented as fact by LLMs can have a harmful impact when high accuracy is required, especially in domains like healthcare.
- Sensitivity to perturbations: in multifaceted contexts, variations of the output due to small changes in the input lead to unreliable and unstable results.
- Concept drift: the risk of outdated results is high in an ever-evolving scenario like medicine, where new topics and trends constantly emerge.
The article overviews the diverse learning paradigms connected to LLMs, KGs, and ontologies. It discusses the peculiarity of the life science field, which encompasses many heterogeneous data. After this introductory part, the article dives into the fundamental principles related to the current approaches for NED. Moreover, it proposes potential solutions…