KIND: A project for the automatic induction of lexical taxonomies
Keywords:
automatic induction of lexical taxonomies, hypernym relations, hypernymy discovery, computational lexicography, lexical semanticsAbstract
This paper presents a description of the Kind Project, an algorithm for automatic induction of lexical taxonomies from corpora. Taxonomy induction consists of the discovery of hypernymy relations between single or multiword noun pairs, and the integration of these pairs into larger structures. The proposed methodology is fundamentally statistical and the requirement of linguistic resources is minimal, a characteristic that facilitates the reproduction of experiments in different languages. The languages for which results have been obtained so far are Spanish, English and French. The implementation of the algorithm and an online demo are available as open source on the projects’ website.
References
Amsler, R. (1981). A taxonomy for English nouns and verbs. En Proceedings of the 19th annual meeting on Association for Computational Linguistics, pp. 133–138.
Bordea, G.; Lefever, E. y Buitelaar, P. (2016) SemEval-2016 Task 13: Taxonomy extraction evaluation (texeval-2). En Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), Association for Computational Linguistics, pp. 1081–1091.
Borgo, S., y Masolo, C. (2009). Foundational choices in DOLCE. En: S. Staab, y R. Studer (eds.), Handbook on ontologies. Berlín: Springer, pp. 361–382.
Bullinaria, J. (2008). Semantic Categorization Using Simple Word Co-occurrence statistics. En ESSLLI Workshop on Distributional Lexical Semantics, Hamburg, Alemania.
Camacho-Collados, J.; Delli Bovi, C.; Espinosa Anke. L.; Oramas, S.; Pasini, T.; Santus, E.; Shwartz, V.; Navigli, R.; Saggion, H. (2018). SemEval-2018 Task 9: Hypernym Discovery. En Proceedings of The 12th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT, Nueva Orleans, Louisiana, 2018, pp. 712–724.
Chodorow, M.; Byrd, R; y Heidorn G. (1985). Extracting semantic hierarchies from a large on-line dictionary. En Proceedings of the 23rd annual meeting on ACL, Chicago, Illinois, pp. 299–304.
Cornet, R. y de Keizer, N. (2008). Forty years of SNOMED: a literature review. BMC Med Inform Decis Mak 8, S2.
Fellbaum, C. (1998). WordNet: An Electronic Lexical Database. Cambridge: MIT Press.
Grefenstette, G. (1994). Explorations in Automatic Thesaurus Discovery. Norwell, MA: Kluwer Academic Publishers.
Hanks, P. y Pustejovsky, J. (2005). A Pattern Dictionary for Natural Language Processing. Revue Francaise de Langue Appliquée 10.
Harris, Z. (1954). Distributional Structure. Word. 10, pp. 146–162.
Hearst, M. (1992). Automatic Acquisition of Hyponyms from Large Text Corpora. En Proceedings of the 14th Conference on Computational Linguistics – Vol. 2, COLING ’92, pp. 539–545.
Jakubíček, M.; Kilgarriff, A.; Kovář, V.; Rychlý, P. y Suchomel, V. (2013). The TenTen Corpus Family. En 7th International Corpus Linguistics Conference CL 2013, pp. 125–127.
Lenat, D. (1995). CYC: A large-scale investment in knowledge infrastructure. Communications of the ACM, 38 (11), 33-38.
Lin, D. (1998). Automatic Retrieval and Clustering of Similar Words. En: Proceedings of the 17th International Conference on Computational Linguistics - Volume 2, COLING ’98, pp. 768–774.
Miller, G. y Hristea, F. (2006). Squibs and Discussions: WordNet Nouns: Classes and Instances, American Journal of Computational Linguistics 32: 1–3.
Navigli, R., & Ponzetto, S. (2010). BabelNet: Building a very large multilingual semantic network. En: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 216–225.
Nazar, R. (2010). A Quantitative Approach to Concept Analysis. Tesis doctoral, Universitat Pompeu Fabra.
Nazar, R.; Balvet, A.; Ferraro, G.; Marín, R.; Renau, I. (2020). Pruning and repopulating a lexical taxonomy: experiments in Spanish, English and French. Journal of Intelligent Systems, 30(1): 376–394.
Nazar, R.; Janssen, M. (2010). Combining Resources: Taxonomy Extraction from Multiple Dictionaries. En Proceedings of The 8th edition of the Language Resources and Evaluation Conference (LREC 2010), pp. 1055–1061.
Nazar, R.; Obreque, J.; Renau, I. (2020). Tarántula –> araña –> animal : asignación de hiperónimos de segundo nivel basada en métodos de similitud distribucional. Procesamiento del Lenguaje Natural, (64): 29–36.
Nazar, R.; Renau, I. (2015). Agrupación semántica de sustantivos basada en similitud distribucional: implicaciones lexicográficas. En María Pilar Garcés Gómez (ed.): Lingüística y diccionarios. Anexos Revista de Lexicografía, vol. 2, pp. 272–295.
Nazar, R.; Renau, I. (2016). A taxonomy of Spanish nouns, a statistical algorithm to generate it and its implementation in open source code. En Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), pp. 1485–1492.
Nazar, R.; Vivaldi, J; Wanner, L. (2012). Co-occurrence Graphs Applied to Taxonomy Extraction in Scientific and Technical Corpora. Procesamiento del Lenguaje Natural, (49): 67–74.
O'Donnell, M. (2008). The UAM CorpusTool: Software for corpus annotation and exploration. En Bretones Callejas, Carmen M. et al. (eds), La lingüística aplicada hoy: comprendiendo el lenguaje y la mente. Universidad de Almería, pp. 1433–1447.
Panchenko, A.; Faralli, S.; Ruppert, E.; Remus, S.; Naets, H.; Fairon, C.;Ponzetto, S. y Biemann, C. (2016). TAXI at SemEval-2016 Task 13: a Taxonomy Induction Method based on Lexico-Syntactic Patterns, Substrings and Focused Crawling. En Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pp. 1320–1327.
Pereira, F.; Tishby, N. y Lee, L. (1993). Distributional clustering of English words. En Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, pp. 183–190.
Potrich, A. y Pianta, E. (2008). L-ISA: Learning Domain Specific Isa-Relations from the Web. En Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008), pp. 2368–2375.
Qiu, W.; Chen, M.; Li, L. y Si, L. (2018). NLP_HZ at SemEval-2018 Task 9: a Nearest Neighbor Approach. En Proceedings of The 12th International Workshop on Semantic Evaluation, pp. 909–913.
Renau, I.; Nazar, R.; Castro, A.; López, B.; Obreque, J. (2019). Verbo y contexto de uso: un análisis basado en corpus con métodos cualitativos y cuantitativos. Revista Signos, 52(101): 878–901.
Renau, I.; Nazar, R. (2012). Hypernymy relations from definiens-definiendum co-occurrence in multiple dictionary definitions. Procesamiento del Lenguaje Natural, (49): 83–90.
Sager, J. (1990). A Practical Course in Terminology Processing, Amsterdam/Philadelphia: John Benjamins.
Sarkar, R.; McCrae, J. y Buitelaar, P. (2018). A supervised approach to taxonomy extraction using word embeddings. En Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), pp. 2059–2064.
Seitner, J; Bizer, C.; Eckert, K.; Faralli, S.; Meusel, R.; Paulheim, H. y Ponzetto, S. (2016). A Large DataBase of Hypernymy Relations Extracted from the Web. En Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC'16), pp. 360–367.
Shwartz, V.; Santus, E. y Schlechtweg, D. (2017). Hypernyms under Siege: Linguistically-motivated Artillery for Hypernymy Detection. En Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, vol. 1, pp. 65–75.
Snow, R.; Jurafsky, D. y Ng, A. (2006). Semantic Taxonomy Induction from Heterogenous Evidence. En Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pp. 801–808.
Weeds, J. y Weir, D. (2003). A General Framework for Distributional Similarity. En Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pp. 81–88.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2021 Anales de Lingüística
Esta obra está bajo una Licencia Creative Commons Atribución 2.5 Argentina.
Los/as autores/as que publican en esta revista están de acuerdo con los siguientes términos:
1. Los/as autores conservan los derechos de autor y garantizan a la revista el derecho de ser la primera publicación del trabajo bajo una licecncia Creative Commons Atribución 2.5 Argentina (CC BY 2.5 AR) . Por esto pueden compartir el trabajo con la referencia explícita de la publicación original en esta revista.
2. Anales de lingüística permite y anima a los autores a difundir la publicación realizada electrónicamente, a través de su enlace y/o de la versión postprint del archivo descargado de forma independiente.
3. Usted es libre de:
Compartir — copiar y redistribuir el material en cualquier medio o formato
Adaptar — remezclar, transformar y construir a partir del material para cualquier propósito, incluso comercialmente.