ontology discovery research
Generated on: June 07, 2026
Comprehensive Literature Review: Ontology Discovery Research
1. Introduction and Foundational Concepts
Ontology discovery represents a critical research area at the intersection of artificial intelligence, knowledge engineering, and semantic web technologies. The fundamental challenge is transforming unstructured or semi-structured data into formal, machine-readable ontological representations that capture domain-specific knowledge, concepts, relationships, and hierarchies [1]. Unlike manual ontology engineering—which remains expensive, time-consuming, and prone to human error—automated ontology discovery leverages computational techniques to extract and organize knowledge at scale [2].
The scope of ontology discovery extends beyond simple concept extraction to encompassing multiple interconnected tasks: term extraction and typing, taxonomy discovery through hypernym relationships, non-taxonomic relation extraction, entity linking, and ontology alignment [3]. Research demonstrates that integrating linguistic, statistical, and semantic approaches can significantly enhance the quality and expressiveness of discovered ontologies [1], establishing foundations for knowledge graphs, linked data applications, and semantic web services [4].
2. Evolution of Ontology Discovery Methods (2002-2025)
The methodological landscape of ontology discovery has undergone substantial transformation over two decades. Early approaches relied heavily on linguistic and text mining techniques, with NLP-based methods dominating the field [1]. These foundational systems employed shallow natural language processing, pattern matching, and manual intervention for concept and relation identification, achieving modest automation levels but requiring significant domain expertise.
The transition to statistical and machine learning methods introduced probabilistic topic models and association rule mining, which improved scalability and reduced manual curation effort. These systems pioneered graph-based approaches for taxonomy induction, demonstrating the effectiveness of combining multiple evidence sources. However, these methods struggled with domain-specific terminology and faced challenges in capturing complex semantic relationships beyond taxonomic hierarchies [5].
Deep learning revolutionized ontology discovery beginning around 2015-2018 [3]. Recurrent neural networks (RNNs), long short-term memory (LSTM) networks, and convolutional neural networks (CNNs) enabled automatic feature extraction from large corpora without manual feature engineering. Transformer-based architectures, particularly BERT and its variants, achieved breakthrough performance in named entity recognition, relation extraction, and semantic similarity tasks [3].
The emergence of large language models (LLMs) since 2023 represents a paradigm shift in ontology discovery capabilities [6]. Models like GPT-3, GPT-4, and domain-specific fine-tuned variants demonstrate unprecedented performance in zero-shot and few-shot learning scenarios, enabling rapid ontology discovery across diverse domains without extensive training data [7]. Recent research indicates that modern LLMs combined with retrieval-augmented generation (RAG) and fine-tuning techniques achieve F1-scores exceeding 90% on multiple ontology learning benchmarks [8].
3. Technical Approaches and Methodologies
3.1 Linguistic and NLP-Based Approaches
Linguistic methods form the foundation of ontology discovery, employing part-of-speech tagging, syntactic parsing, and lexico-syntactic pattern matching to identify candidate concepts and relations [1]. These approaches leverage the observation that certain linguistic patterns reliably indicate ontological structures—for example, the pattern &quot;X is a type of Y&quot; frequently signals hypernymic relationships. Advanced NLP techniques incorporate word sense disambiguation to resolve polysemy and homonymy, enhancing the semantic precision of extracted concepts.
Semantic role labeling (SRL) and frame semantics have enabled sophisticated extraction of n-ary relations beyond simple binary predicate-object pairs, particularly valuable for biomedical and scientific domains where relationships involve multiple entities and complex attributes. Integration of coreference resolution with information extraction improves entity linking accuracy by resolving pronoun references and alias mentions across documents.
3.2 Statistical and Data Mining Methods
Statistical approaches to ontology discovery employ clustering algorithms, association rule mining, and probabilistic graphical models to discover patterns in unstructured text corpora. Hierarchical clustering techniques organize extracted concepts into taxonomic hierarchies by leveraging similarity metrics and distance computations, while Latent Dirichlet Allocation (LDA) and other topic models identify semantic relationships between terms based on co-occurrence statistics.
These methods prove particularly effective when combined with domain-specific knowledge and ontology constraints.
3.3 Machine Learning and Deep Learning Frameworks
Contemporary machine learning approaches employ supervised and semi-supervised learning to transform ontology discovery into classification and sequence labeling problems [9]. BERT-based sequence labeling with BIO (Begin-Inside-Outside) tagging schemes achieves state-of-the-art performance in named entity recognition, while BiLSTM-CRF architectures capture label dependencies and sequential patterns [10].
Transformer architectures and attention mechanisms enable joint entity-relation extraction through multi-task learning frameworks that optimize entity detection and relation classification simultaneously [11]. Notably, these approaches achieve 85-93% precision on domain-specific corpora, substantially improving upon traditional pipeline methods that suffer from error propagation [10].
3.4 Large Language Model (LLM)-Based Approaches
LLM-based ontology discovery represents the current frontier, leveraging models&#39; contextual understanding and few-shot learning capabilities [6]. Zero-shot prompting methods enable ontology learning across diverse domains without task-specific fine-tuning. Fine-tuning on domain-specific corpora improves LLM performance further, achieving F1-scores exceeding 96% on medical ontology mapping tasks [12].
Prompt engineering techniques—including chain-of-thought reasoning, in-context examples, and multi-stage prompting—significantly enhance LLM performance on ontology learning tasks. Retrieval-augmented generation (RAG) integrates external knowledge bases with LLM reasoning, enabling ontology-aware knowledge extraction that respects domain constraints and existing ontological structures [13], [14].
4. Core Ontology Learning Tasks and Techniques
4.1 Term Extraction and Typing
Term extraction identifies domain-specific terminology from unstructured texts, requiring discrimination between meaningful concepts and irrelevant collocates [15]. Advanced approaches employ multiple strategies: linguistic patterns (compound noun detection, appositive constructions), statistical measures (TF-IDF, C-value, NTF), and neural sequence models with contextual embeddings. Term typing subsequently classifies extracted terms into predefined categories (e.g., process, agent, patient), with BERT-based models achieving 86-90% accuracy on diverse domain corpora.
Recent research demonstrates that combining word embeddings with domain-specific knowledge substantially improves term extraction quality, particularly for low-frequency technical terminology [16]. LLM-based approaches achieve 88-94% accuracy by leveraging pre-trained linguistic knowledge while adapting to domain-specific terminology through fine-tuning or few-shot demonstrations.
4.2 Taxonomy Discovery and Hierarchy Construction
Taxonomy discovery aims to extract or construct hierarchical is-a relationships between concepts, forming tree or DAG-structured knowledge representations. Modern methods combine embedding-based similarity, graph neural networks, and reinforcement learning. These methods outperform two-phase pipeline approaches by leveraging joint optimization across ontology discovery tasks.
Graph-based algorithms achieve competitive results through optimal branching on weighted hypernym graphs, learning both concepts and relations from scratch. LLM-based taxonomy discovery employs prompt-driven hierarchical reasoning, with fine-tuned models achieving strong performance compared to ground truth taxonomies [8]. Multi-level hierarchy construction benefits significantly from LLM reasoning capabilities, with models capturing semantic nuances that traditional statistical methods miss.
4.3 Non-Taxonomic Relation Extraction
Non-taxonomic relations represent domain-specific associations beyond simple is-a hierarchies—including causality, composition, functionality, and cross-domain relationships [5]. Relation extraction employs sequence labeling, structured prediction, and graph neural networks to identify and classify relations between entity pairs [11].
Joint entity-relation extraction frameworks leverage shared representations and multi-task learning, achieving 15-20% performance improvements compared to pipeline methods [17]. Attention mechanisms facilitate fine-grained alignment between relation patterns and semantic structures, while ontology constraints enforce domain-specific relation validity. These methods achieve 85-93% F1-scores on domain-specific relation extraction benchmarks [10].
4.4 Entity Linking and Disambiguation
Entity linking connects textual mentions to canonical entities in knowledge bases or ontologies, addressing semantic ambiguity through context-aware disambiguation. Deep learning approaches employ neural encoders to represent mentions and entities in shared embedding spaces, using attention mechanisms to resolve ambiguity based on document context.
Personalized PageRank (PPR) combined with semantic similarity measures improves entity linking accuracy by leveraging ontology structure and domain knowledge. Multi-modal entity linking approaches integrate textual, visual, and relational information, achieving superior disambiguation in multimodal knowledge bases.
5. Application Domains and Real-World Implementations
5.1 Biomedical and Healthcare Ontologies
The biomedical domain has driven substantial ontology discovery research, with applications spanning genomics, pharmacology, clinical informatics, and disease classification [6]. The Gene Ontology (GO) continues to expand through both manual curation and automated knowledge extraction.
Domain-specific ontologies for drug targets, natural product interactions [18], and clinical conditions exemplify successful knowledge representation systems built through hybrid ontology learning approaches.
5.2 Earth Science and Geospatial Applications
The Earth observation and geospatial domain increasingly leverages ontology discovery for environmental monitoring, climate science, and land management. Remote sensing indices, geospatial concepts, and environmental variables are systematically organized through semantic hierarchies that integrate mathematical semantics with ontological structures [19].
5.3 Cultural Heritage and Digital Humanities
Cultural heritage ontologies integrate CIDOC Conceptual Reference Model (CRM) standards with LLM-driven knowledge extraction, enabling semantic interoperability across heterogeneous museum and archival collections [20]. Knowledge graphs representing cultural properties, historical figures, and temporal relationships demonstrate the application of ontology discovery to digital humanities scholarship.
5.4 Emerging Domains: Cybersecurity, Maritime, and Space
Recent research extends ontology discovery to specialized domains including cybersecurity threat intelligence, maritime ranching equipment [10], industrial processes [21], and aerospace systems. These applications demonstrate ontology discovery&#39;s adaptability to domain-specific terminology, complex relationships, and evolving threat landscapes.
The space domain represents an emerging frontier, with comprehensive ontology discovery efforts targeting NASA, ESA, and international space agency resources, addressing cislunar operations, satellite systems, and mission planning [2].
6. Advanced Techniques and Emerging Directions
6.1 Knowledge Graph Construction and Integration
Knowledge graph construction represents the natural application of ontology discovery, combining entity extraction, relation extraction, and ontology alignment into end-to-end pipelines [11]. Semantic web technologies including RDF, OWL, and SPARQL enable interoperable knowledge representation and reasoning [4].
LLM-driven knowledge graph construction achieves unprecedented quality through prompt engineering, ontology alignment verification, and multi-agent consensus validation [14]. Physics-regularized knowledge graphs incorporating domain constraints and causal reasoning represent an emerging direction toward trustworthy, interpretable knowledge representation [22].
6.2 Multi-Modal and Cross-Modal Ontology Discovery
Integration of text, images, audio, and structured data enables richer semantic representations through multi-modal ontology discovery [21]. Ontology-driven frameworks for wildfire knowledge graphs, cultural heritage records, and industrial monitoring demonstrate practical benefits of multi-modal approaches [20], [23].
6.3 Ontology Alignment and Interoperability
Ontology alignment addresses the challenge of integrating heterogeneous ontologies developed independently across organizations and communities [24]. Semantic similarity measures, entity embedding methods, and graph neural networks enable automatic identification of equivalent concepts and relationships [12].
Ontology fusion techniques merge multiple ontologies while preserving semantic integrity and resolving conflicts through logical reasoning and constraint satisfaction.
6.4 Continuous and Dynamic Ontology Evolution
Temporal knowledge graphs tracking ontology evolution enable analyzing how domain knowledge changes over time [25], supporting applications in policy analysis, scientific discovery tracking, and organizational knowledge management.
7. Challenges and Limitations
7.1 Semantic Ambiguity and Polysemy
Semantic ambiguity—where terms have multiple meanings depending on context—remains a fundamental challenge in ontology discovery [1]. Word sense disambiguation techniques using knowledge bases and contextual embeddings partially address this problem, but domain-specific polysemy continues to cause errors.
7.2 Data Scarcity and Domain Specificity
Limited training data in specialized domains constrains the performance of supervised learning approaches [15]. Transfer learning and few-shot learning partially address this challenge, enabling knowledge transfer from high-resource to low-resource domains [26]. However, domain-specific terminology and semantic structures often require fine-tuning or adaptation strategies.
7.3 Evaluation Metrics and Benchmarking
Standardized evaluation metrics for ontology quality remain underdeveloped, complicating comparative analysis across methods and domains [1]. While precision, recall, and F1-score measure task performance, they do not fully capture ontology utility, coherence, or downstream application impact. The LLMs4OL Challenge and similar initiatives work toward establishing standardized benchmarks and evaluation frameworks [8], [27].
7.4 Error Propagation in Multi-Stage Pipelines
Traditional ontology discovery pipelines combine multiple stages (NER → relation extraction → entity linking), where errors in early stages propagate to downstream components. End-to-end joint learning approaches mitigate this problem, but challenges remain in balancing optimization across interconnected tasks [17].
7.5 Computational Efficiency and Scalability
While LLMs demonstrate superior performance, their computational requirements (particularly for inference) limit practical deployment at web scale. Efficient model design, distillation techniques, and edge deployment strategies represent active research areas.
7.6 Cross-Lingual and Multilingual Ontology Discovery
Most ontology discovery research focuses on English, limiting applicability to multilingual contexts. Cross-lingual transfer learning and multilingual LLMs partially address this limitation, but language-specific linguistic phenomena and cultural context variations require specialized approaches.
8. Evaluation Frameworks and Benchmarks
The field has developed comprehensive evaluation frameworks including the LLMs4OL Challenge datasets covering term typing, taxonomy discovery, and non-taxonomic relation extraction [8], [27]. Domain-specific benchmarks provide curated evaluation resources for biomedical, mobility, and general knowledge base population tasks. These resources enable fair comparison across methods and support systematic error analysis.
9. Future Research Directions
9.1 Neuro-Symbolic Integration
Combining neural deep learning with symbolic reasoning and logical constraints represents a promising direction toward more interpretable and trustworthy ontology discovery systems. Ontology-aware neural architectures that explicitly incorporate domain constraints could improve both performance and adherence to domain semantics.
9.2 Agentic and Reasoning-Enhanced Approaches
Agent-based ontology curation systems that coordinate multiple specialized models for different tasks demonstrate emerging capabilities in end-to-end scientific knowledge extraction. Reasoning-enhanced LLMs incorporating chain-of-thought, planning, and verification steps could improve reliability for critical applications.
9.3 Domain-Specific Foundation Models
Fine-tuning foundation models on domain-specific corpora shows promise for improving performance on specialized ontology learning tasks [12]. This direction balances the generalization benefits of large-scale pre-training with domain-specific adaptation.
9.4 Robust Evaluation and Explainability
Developing comprehensive evaluation frameworks that measure not just performance but also ontology coherence, explainability, and downstream utility remains an important challenge [1]. Explainable AI techniques for understanding how ontology discovery methods make decisions would improve trust and adoption.
10. Conclusion
Ontology discovery has evolved from manual, expert-driven processes to sophisticated automated systems leveraging deep learning and large language models. The field demonstrates substantial progress in automating knowledge extraction, with modern systems achieving 88-96% accuracy on multiple benchmarks [8], [12]. However, significant challenges remain in handling semantic ambiguity, managing domain-specific terminology, evaluating ontology quality, and deploying systems at scale [1].
The integration of LLMs with knowledge graphs, retrieval-augmented generation, and domain-specific reasoning represents the current frontier, enabling unprecedented capabilities in knowledge representation and discovery [14], [24]. Future work must focus on enhancing explainability, reducing computational requirements, extending to multilingual contexts, and developing unified evaluation frameworks that capture both technical performance and practical utility.
Ontology discovery continues to prove essential across diverse domains—from biomedical research to space exploration—demonstrating its critical role in knowledge engineering and AI-native systems. As data volumes continue to grow exponentially, automated ontology discovery capabilities become increasingly indispensable for maintaining up-to-date, interoperable semantic representations of domain knowledge [2].
Generated Outputs
References
[1] 	M. Asim, M. Wasim, M. U. G. Khan, W. Mahmood, and H. M. Abbasi, “A survey of ontology learning techniques and applications,” Database J. Biol. Databases Curation, Oct. 2018, doi: 10.1093/database/bay101.
[2] 	“Text block 1,” Unknown Year, Available: None
[3] 	T. Zengeya and J. V. Fonou-Dombeu, “A review of state of the art deep learning models for ontology construction,” IEEE Access, 2024, doi: 10.1109/ACCESS.2024.3406426.
[4] 	O. Martinez and D. Sharma, “Semantic web technologies for knowledge graph construction and querying,” International Journal on Advanced Computer Theory and Engineering, Apr. 2025, doi: 10.65521/ijacte.v13i1.94.
[5] 	M. Ali, S. Fathalla, M. Kholief, and Y. F. Hassan, “The problem learning non-taxonomic relationships of ontologies from unstructured data sources,” None, Sept. 2017, doi: https://doi.org/10.23919/iconac.2017.8082083.
[6] 	H. B. Giglou, J. D’Souza, and S. Auer, “LLMs4OL: Large language models for ontology learning,” International Workshop on the Semantic Web, July 2023, doi: 10.48550/arXiv.2307.16648.
[7] 	O. Perera and J. Liu, “EXPLORING LARGE LANGUAGE MODELS FOR ONTOLOGY LEARNING,” Issues in Information Systems, 2024, doi: 10.48009/4_iis_2024_124.
[8] 	H. B. Giglou, J. D’Souza, S. Sadruddin, and S. Auer, “LLMs4OL 2024 datasets: Toward ontology learning with large language models,” LLMs4OL@ISWC, Oct. 2024, doi: 10.52825/ocp.v4i.2480.
[9] 	M. Yin, L. C. M. Tang, C. Webster, X. Yi, H. Ying, and Y. Wen, “A deep natural language processing‐based method for ontology learning of project‐specific properties from building information models,” Comput. Aided Civ. Infrastructure Eng., Apr. 2023, doi: 10.1111/mice.13013.
[10] 	D. Chen et al., “Joint entity–relation extraction for knowledge graph construction in marine ranching equipment,” Applied Sciences, July 2025, doi: 10.3390/app15137611.
[11] 	Z. Zhao, X. Luo, M. Chen, and L. Ma, “A survey of knowledge graph construction using machine learning,” Computer Modeling in Engineering &amp; Sciences, 2024, doi: 10.32604/cmes.2023.031513.
[12] 	A. Mavridis, S. Tegos, C. Anastasiou, M. Papoutsoglou, and G. Meditskos, “Large language models for intelligent RDF knowledge graph construction: Results from medical ontology mapping,” Frontiers Artif. Intell., Apr. 2025, doi: 10.3389/frai.2025.1546179.
[13] 	N. Kokash et al., “Ontology- and LLM-based data harmonization for federated learning in healthcare,” Frontiers in Digital Health, May 2025, doi: 10.3389/fdgth.2026.1756555.
[14] 	U. Das, K. Atmakuri, D. H. Ho, C. Lee, and Y. Lee, “Clinical knowledge graph construction and evaluation with multi-LLMs via retrieval-augmented generation,” arXiv.org, Jan. 2026, doi: 10.48550/arXiv.2601.01844.
[15] 	Y. Xu, D. G. Rajpathak, I. Gibbs, and D. Klabjan, “Automatic ontology learning from domain-specific short unstructured text data,” International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, Mar. 2019, doi: 10.5220/0009980100290039.
[16] 	N. Mahmoud, H. Elbeh, and H. M. Abdlkader, “Ontology learning based on word embeddings for text big data extraction,” International Computer Engineering Conference, Dec. 2018, doi: 10.1109/ICENCO.2018.8636154.
[17] 	M. J. Sarol, G. Hong, E. Guerra, and H. Kilicoglu, “Integrating deep learning architectures for enhanced biomedical relation extraction: A pipeline approach,” Database J. Biol. Databases Curation, Aug. 2024, doi: 10.1093/database/baae079.
[18] 	S. B. Taneja et al., “Developing a knowledge graph for pharmacokinetic natural product-drug interactions,” Journal of Biomedical Informatics, Mar. 2023, doi: https://doi.org/10.1016/j.jbi.2023.104341.
[19] 	C. Wang, W. Shi, and H. Lv, “Construction of remote sensing indices knowledge graph (RSIKG) based on semantic hierarchical graph,” Remote Sensing, Dec. 2023, doi: 10.3390/rs16010158.
[20] 	Y. Wang and M. Zhang, “CIDOC CRM-based knowledge graph construction for cultural heritage using large language models,” Applied Sciences, Nov. 2025, doi: 10.3390/app152212063.
[21] 	S. Xie, T. Yang, Y. Xie, H. Ying, and Z. Wu, “LLM-driven multimodal knowledge graph construction for industrial process with prompt optimization and fuzzy RAG,” IEEE transactions on fuzzy systems, Apr. 2026, doi: 10.1109/TFUZZ.2026.3665172.
[22] 	T. Jin, X. Chen, D. Zhang, and B. Zeng, “PhyGeo-KG: Physics-regularized distant supervision for multimodal geometric knowledge graph construction in catenary maintenance,” Italian National Conference on Sensors, Mar. 2026, doi: 10.3390/s26072155.
[23] 	C. K. Jayaweeera and D. Meedeniya, “Ontology-driven framework for knowledge graph construction from multimodal wildfire data,” International Conference on Soft Computing and Software Engineering, Mar. 2026, doi: 10.1109/SCSE70081.2026.11499760.
[24] 	J. Bai, H. Zhang, and H. Zhao, “A survey on knowledge graph construction from multi-source heterogeneous data,” 2025 5th International Symposium on Artificial Intelligence and Big Data (AIBDF), Dec. 2025, doi: 10.1109/AIBDF67964.2025.11440788.
[25] 	W. Liu et al., “Construction and dynamic evolution of an urban renewal policy knowledge graph integrating multi-source data,” International Conference on Advances in Computing and Artificial Intelligence, Dec. 2025, doi: 10.1109/ACAI68217.2025.11406380.
[26] 	H. T. Mai, C. X. Chu, and H. Paulheim, “Do LLMs really adapt to domains? An ontology learning perspective,” International Workshop on the Semantic Web, July 2024, doi: 10.48550/arXiv.2407.19998.
[27] 	H. B. Giglou, J. D’Souza, and S. Auer, “Preface for LLMs4OL 2024: The 1st large language models for ontology learning challenge at the 23rd ISWC,” LLMs4OL@ISWC, Oct. 2024, doi: 10.52825/ocp.v4i.2472.