Machine Learning and Deep Neural Network-Based Lemmatization and Morphosyntactic Tagging for Serbian

Објеката

Тип
Рад у зборнику
Верзија рада
објављена
Језик
енглески
Креатор
Ranka Stanković, Branislava Šandrih, Cvetana Krstev, Miloš Utvić, Mihailo Škorić
Извор
Proceedings of the 12th Language Resources and Evaluation Conference, May Year: 2020, Marseille, France
Издавач
European Language Resources Association
Датум издавања
2020
Сажетак
The training of new tagger models for Serbian is primarily motivated by the enhancement of the existing tagset with the grammatical category of a gender. The harmonization of resources that were manually annotated within different projects over a long period of time was an important task, enabled by the development of tools that support partial automation. The supporting tools take into account different taggers and tagsets. This paper focuses on TreeTagger and spaCy taggers, and the annotation schema alignment between Serbian morphological dictionaries, MULTEXT-East and Universal Part-of-Speech tagset. The trained models will be used to publish the new version of the Corpus of Contemporary Serbian as well as the Serbian literary corpus. The performance of developed taggers were compared and the impact of training set size was investigated, which resulted in around 98% PoS-tagging precision per token for both new models. The sr_basic annotated dataset will also be published.
почетак странице
3954
крај странице
3962
Subject
Part-of-Speech tagging, lemmatization, corpus, evaluation, Serbian, morphological dictionary
Шира категорија рада
М30
Ужа категорија рада
М33
Права
Отворени приступ
Лиценца
All rights reserved
Формат
.pdf
Медија
2020.lrec-1.487.pdf

Ranka Stanković, Branislava Šandrih, Cvetana Krstev, Miloš Utvić, Mihailo Škorić. "Machine Learning and Deep Neural Network-Based Lemmatization and Morphosyntactic Tagging for Serbian" in Proceedings of the 12th Language Resources and Evaluation Conference, May Year: 2020, Marseille, France, European Language Resources Association (2020)

This item was submitted on 25. март 2021. by [anonymous user] using the form “Рад у зборнику радова” on the site “Радови”: http://gabp-dl.rgf.rs/s/repo

Click here to view the collected data.