Named entity annotation schema for geological literature mining in the domain of porphyry copper deposits
2022
ORE GEOLOGY REVIEWS
DOI
10.1016/j.oregeorev.2022.105243
Owing to the development of natural language processing and deep learning models, geological text data have become a vital resource for knowledge discovery and have attracted the attention of publishers, academic organizations, and domain scientists. However, the extraction of information from unstructured literature still remains a challenge, in which a fundamental issue is the categories and the type of discipline-specific information. This paper presents an effective workflow of building and applying ontologies in geoscience text mining, which includes a use case-driven method for building an ontology model of porphyry copper deposits, an entity annotation schema for text mining, and implementation of them to tackle real-world data. First, the Dexing porphyry copper deposit was selected as a case study to guide the construction of the ontology model. Text data in this study provided a series of entity instances. By analyzing both domain knowledge of mineral deposit models and the instance data, we built classes in the ontology. Second, with the established ontology, a named entity annotation schema comprising 21 entity tokens was designed to scale up the text mining tasks. Third, based on the annotation schema, a draft corpus with more than 200,000 words and a finely corrected corpus of 53,339 words were built for training a geological entity recognizer for porphyry copper deposits. The performance of the geological entity recognizer and the statistical distribution of entities in the corpus prove that the workflow presented in this study is effective for designing entity annotation schemas and facilitating large-scale text data mining in geoscience.