Претрага
109 items
-
Multiword Expressions between the Corpus and the Lexicon: Universality, Idiosyncrasy and the Lexicon-Corpus Interface
Verginica Barbu Mititelu, Voula Giouli, Kilian Evang, Daniel Zeman, Petya Osenova, Carole Tiberius, Simon Krek, Stella Markantonatou, Ivelina Stoyanova, Ranka Stankovic, Christian Chiarcos (2024)Predstavljamo trenutne aktivnosti na definisanju interfejsa leksikona i korpusa koji će služiti kao referenca u prikazu polileksemskih jedinica - višečlanih izraza - (različitih tipova - imenskih, glagolskih, itd.) u specijalizovanim leksikonima i povezivanju ovih unosa sa njihovim pojavljivanjima u korpusima. Konačni cilj je korišćenje ovakvih resursa za automatsko identifikovanje višečlanih izraza u tekstu. Uključivanje nekoliko prirodnih jezika ima za cilj univerzalnost rešenja koje nije usredsređeno na određeni jezik, kao i prilagođavanje idiosinkrazijama. Raspravljaju se izazovi u leksikografskom opisu višerečnih ...Verginica Barbu Mititelu, Voula Giouli, Kilian Evang, Daniel Zeman, Petya Osenova, Carole Tiberius, Simon Krek, Stella Markantonatou, Ivelina Stoyanova, Ranka Stankovic, Christian Chiarcos. "Multiword Expressions between the Corpus and the Lexicon: Universality, Idiosyncrasy and the Lexicon-Corpus Interface" in Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024, Turin, May 25, 2024, ELRA and ICCL (2024)
-
Towards the semantic annotation of SR-ELEXIS corpus: Insights into Multiword Expressions and Named Entities
Овај рад представља активности на развоју корпуса ELEXIS-sr, српском додатку вишејезичном анотираном корпусу ELEXIS-а, који се састоји од семантичких анотација и репозиторија значења речи. ELEXIS је паралелни вишејезични анотирани корпус на десет европских језика, који може да се користи као вишејезички репер за евалуацију европских језика са мање и средње развијеним ресурсима. Фокус овог рада је на вишечланим изразима и именованим ентитетима, њиховом препознавању у скупу реченица ELEXIS-sr и поређењу са анотацијама на другим језицима. Разматрају се први кораци ...Cvetana Krstev, Ranka Stanković, Aleksandra Marković, Teodora Mihajlov. "Towards the semantic annotation of SR-ELEXIS corpus: Insights into Multiword Expressions and Named Entities" in Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024, Turin, May 25, 2024, ELRA and ICCL (2024)
-
Multi-word Expressions for Abusive Speech Detection in Serbian
Ovaj rad predstavlja istraživanja na usavršavanju i unapređenju srpske verzije rečnika Hurtlex, višejezičnog leksikona uvredljivih reči. Posebnu pažnju posvećujemo dodavanju izraza sa više reči (polileksemskih jedinica) koji se mogu smatrati uvredljivim, jer su takvi leksički zapisi veoma važni za postizanje dobrih rezultata u mnoštvu zadataka otkrivanja uvredljivog jezika. Srpski morfološki rečnici se koriste kao osnova za čišćenje podataka i stvaranje rečnika. Istaknuta je veza sa drugim leksičkim i semantičkim resursima na srpskom jeziku i predviđena je izgradnja sistema za ...... Machinery. Antonio Moreno-Ortiz, Chantal Pérez-Hernández, and Maria Del-Olmo. 2013. Managing multiword expressions in a lexicon-based sentiment analysis system for spanish. In Proceedings of the 9th Workshop on Multiword Expressions, pages 1–10. Chikashi Nobata, Joel Tetreault, Achint Thomas, Yashar Mehdad ...
... multilingual online lexicon of hate speech available at hatebase.org in their research. (Wiegand et al., 2018; Silva et al., 2016; Nobata et al., 2016). Wiegand et al. (2018) built a lexicon of abusive words using the subjectivity lexicon of Therese Wilson that is in essence a sentiment lexicon. They took words ...
... words with negative polarity as a baseline for creating a basic lexicon of 551 words, which was further enriched via machine learning into a lexicon of 2898 abusive words. Several authors used the Wiegand lexicon as a blacklist in their hate speech and abusive language detection systems (Wiegand et al ...Ranka Stanković, Jelena Mitrović, Danka Jokić, Cvetana Krstev. "Multi-word Expressions for Abusive Speech Detection in Serbian" in Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons, Association for Computational Linguistics (2020)
-
Vebran Web Services for Corpus Query Expansion
Ranka Stanković, Miloš Utvić (2020)U ovom radu se govori o razvoju veb usluga Vebran i njihovoj primeni u poboljšanju pretraživanja korpusa. Veb-servisi Vebran koriste se za konsultovanje spoljnih leksičkih izvora za srpski jezik (uglavnom elektronski morfološki rečnici i srpski Vordnet) i proširivanje korisničkih upita radi dobijanja relevantnijih rezultata iz srpskih korpusa.... format of a TreeTagger full-form lexicon. Each entry of the TreeTagger full-form lexicon contains one-word form and a sequence of tag-lemma pairs that could correspond to that word form (Schmid, 1997). TreeTagger full- form lexicon does not allow the possibility of a lexicon entry with two or more tag-lemma ...
... have homograph word forms (tati, tatom, tate, tatu, tata) causing that lexicon entries with these forms cannot contain both tag-lemma pairs (N, tat) and (N, tata) where N is PoS tag denoting noun. Thus, creator of full-form lexicon has to choose which tag-lemma pair will keep and the choice is commonly ...
... Inflected forms are stored in LeXimirka database and orig- inate from Unitex DELAF and DELACF dictionaries, described in Section 3.1. The inflection of multiword units is additionally supported by the rule based system. The system supports different alphabets and character encod- ings (the aurora alphabet and ...Ranka Stanković, Miloš Utvić. "Vebran Web Services for Corpus Query Expansion" in Infotheca, Faculty of Philology, University of Belgrade (2020). https://doi.org/10.18485/infotheca.2019.19.2.5
-
Digital Library From A Domain Of Criminalistics As A Foundation For A Forensic Text Analysis
U ovom radu predstavljen je model koji omogućava prikupljanje, pripremu, opis metapodataka, upravljanje i eksploataciju, uključujući pretragu punog teksta dokumenata iz domena kriminalistike napisanih na srpskom jeziku. Predloženi pristup primenjuje se na veb portalu koji sakuplja različite tekstove nastale iz časopisa Akademije za kriminalistiku i policijske studije, Krivičnog zakona Srbije, konferencija „Tara“ i „Reiss“, kao i iz nekih doktorskih disertacija vezanih za ovu oblast istraživanje. Nakon obrade teksta, korpus koji sadrži preko 5500 stranica običnog teksta, kreiran je i ...... SentiWordNet sentiment scores (positive and negative) used for each SWN synset. A sentiment lexicon is produced using word forms defined in SWN that have positive or negative sentiment scores. This kind of a lexicon is applied in sentiment polarity classification tasks on Serbian texts, achieving 97.1% ...
... but analyse different contexts of occuraces, more sophisticated query can be requested. The following expression:is an example of morphological and semantic expression search in Unitex system. This query is retrieving any inflective form of lemma napad (attack), preceded ...
... RudOnto and Librarian dictionary. Apart from the grammars in the form finite state automata and transducers, system is using rules for inflection of multiword units. Among textual resources are most important digital libraries, Unitex corpora16 and CQP web corpora. Linguistic support is implemented via ...Dalibor Vorkapić, Aleksandra Tomašević, Miljana Mladenović, Ranka Stanković, Nikola Vulović. "Digital Library From A Domain Of Criminalistics As A Foundation For A Forensic Text Analysis" in International Scientific Conference “Archibald Reiss Days” Thematic Conference Proceedings Of International Significance, Belgrade, 7-9 November 2017, Academy Of Criminalistic And Police Studies Belgrade (2017)
-
Terminology Acquisition and Description Using Lexical Resources and Local Grammars
Acquisition of new terminology from specific domains and its adequate description within terminological dictionaries is a complex task, especially for languages that are morphologically complex such as Serbian. In this paper we present an approach to solving this task semi-automatically on basis of lexical resources and local grammars developed for Serbian. Special attention is given to automatic inflectional class prediction for simple adjectives and nouns and the use of syntactic graphs for extraction of Multi-Word Unit (MWU) candidates for ...... & Makowiecki, F (2012). SEJFEK—a Lexicon and a Shallow Grammar of Polish Economic Multi-Word Units. Proc. of Cognitive Aspects of the Lexicon (COGALEX-III). (pp. 195-214). Zhang, Y., Kordoni, V., Villavicencio, A., & Idiart, M. (2006). Automated multiword expression prediction for grammar engineering ...
... F., & Rubino, F. (2012). A MWE Acquisition and Lexicon Builder Web Ser- vice. Proc. of COLING 2012 (pp. 2291-2306). Ramisch, C., De Araujo, V., & Villavicencio, A. (2012). A broad evaluation of techniques for auto- matic acquisition of multiword expressions. Proc. of ACL 2012 Student Research ...
... in the domain of ecsonomy is presented for Polish. It has two modules: a grammatical lexicon of terminological MWEs and a fully lexicalized shallow grammar, obtained by an automatic con- version of the lexicon. Przepiorkowski and asso- ciates (2007) present results of automatic extraction of term ...Cvetana Krstev, Ranka Stanković, Ivan Obradović, Biljana Lazić. "Terminology Acquisition and Description Using Lexical Resources and Local Grammars" in Proceedings of the 11th Conference on Terminology and Artificial Intelligence, Granada, Spain, 2015, Granada : LexiCon (Universidad de Granada) (2015)
-
FrameNet Lexical Database: Presenting a Few Frames Within the Risk Domain
U radu se daje kratak prikaz teorije semantike okvira, na kojoj je zasnovana leksička baza Frejmnet. Predstavljena je koncepcija ove mreže, kao i mogućnosti njene primene. Predstavljena je i leksička analiza koja se primenjuje u projektu izrade Frejmneta i ukazano na razlike između analize zasnovane na okviru u odnosu na analizu zasnovanu na reči. Zatim je prikazano nekoliko povezanih okvira koje prizivaju reči iz domena rizika. U radu je predstavljena i platforma NLTК pomoću koje se mogu koristiti ...... (2020) presents interesting research done for Serbian and Croatian (viewed as varieties of one language) on lex- emes that both enter the general lexicon and form part of a certain profes- sional domain (in this case legal terminology). It focused on whether or not 14 Infotheca Vol. 21, No. 1, September ...
... structures (88–89). The authors of the paper explored the meaning of the word odredba (section of a legal act) within the legal framework and the general lexicon (where it can be used as a synonym for a legal act as a whole) in both Serbian and Croatian corpus data. They used distributional analysis whose main ...
... Charles J, and Sue Atkins. 1994. “Starting where the Dictionaries Stop: The Challenge of Corpus Lexicography.” In Computational Ap- proaches to the Lexicon, edited by Sue Atkins and Antonio Zampolli, 349–393. Oxford: OUP. Fillmore, Charles J, Miriam RL Petruck, Josef Ruppenhofer, and Abby Wright. 2003 ...Aleksandra Marković, Ranka Stanković, Natalija Tomić, Olivera Kitanović. "FrameNet Lexical Database: Presenting a Few Frames Within the Risk Domain" in Infotheca, Faculty of Philology, University of Belgrade (2021). https://doi.org/10.18485/infotheca.2021.21.1.1
-
Extraction of Bilingual Terminology Using Graphs, Dictionaries and GIZA++
Branislava Šandrih, Ranka Stanković (2020)U nauci, industriji i mnogim istraživačkim oblastima, terminologija se brzo razvija. Najčešće, jezik koji je „lingua franca“ za većinu ovih oblasti je engleski. Kao posledica toga, za mnoga polja termini domena su koncipirani na engleskom, a kasnije se prevode na druge jezike. U ovom radu predstavljamo pristup za automatsko izdvajanje dvojezične terminologije za englesko-srpski jezički par koji se oslanja na usaglašeni dvojezični korpus domena, ekstraktor terminologije za ciljni jezik i alat za usklađivanje delova. Ispitujemo performanse metode na domenu ...... Language Resources Association (ELRA), 2012 Constant, Mathieu, Gülşen Eryiğit, Johanna Monti, Lonneke Van Der Plas, Carlos Ramisch et al. “Multiword Expression Processing: A Survey”. Com- putational Linguistics Vol. 43, no. 4 (2017): 837–892 Cram, D. and B. Daille. “Terminology Extraction with Term ...
... Translation in a Computer Aided Translation Environment”. Natural Language Engineering Vol. 23, no. 5 (2017): 763–788 Baldwin, Timothy and Su Nam Kim. “Multiword Expressions”. Handbook of Natural Language Processing Vol. 2 (2010): 267–292 Bouamor, Dhouha, Nasredine Semmar and Pierre Zweigenbaum. “Identi- fying ...
... Engineering (TKE 2012), June, 20–21. 2012 Princeton WordNet, 2010 Semmar, Nasredine. “A Hybrid Approach for Automatic Extraction of Bilin- gual Multiword Expressions from Parallel Corpora”. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), chair) ...Branislava Šandrih, Ranka Stanković. "Extraction of Bilingual Terminology Using Graphs, Dictionaries and GIZA++" in Infotheca, Faculty of Philology, University of Belgrade (2020). https://doi.org/10.18485/infotheca.2019.19.2.6
-
Electronic Dictionaries - from File System to lemon Based Lexical Database
In this paper we discuss some well-known morphological descriptions used in various projects and applications (most notably MULTEXT-East and Unitex) and illustrate the encountered problems on Serbian. We have spotted four groups of problems: the lack of a value for an existing category, the lack of a category, the interdependence of values and categories lacking some description, and the lack of a support for some types of categories. At the same time, various descriptions often describe exactly the same ...... –http://unitexgramlab.org/ Figure 1: Data categories (markers) dictionary. The main class of the core of the lexicon model is the class LexicalEntry, representing a unit of analysis of the lexicon, which encompasses a set of inflected forms that are grammatically related, and a set of base meanings that ...
... were automatically improved and enriched by intro- ducing new lexical entries and/or lexical relations, and by checking the existing ones. An NLP lexicon has little in common with human-oriented e-dictionary. Data structures in these two types of e- dictionaries are quite different. However, it proved ...
... implemented, neither for lexical database development nor for further processing (Stanković et al., 2013). Finally we considered the lemon model (Lexicon Model for Ontologies), which was derived from LMF, and has been designed for ontology lexicons on the Semantic Web. It is aimed at enriching the ...Ranka Stanković, Cvetana Krstev, Biljana Lazić, Mihailo Škorić. "Electronic Dictionaries - from File System to lemon Based Lexical Database" in Proceedings of the 11th International Conference on Language Resources and Evaluation - W23 6th Workshop on Linked Data in Linguistics : Towards Linguistic Data Science (LDL-2018), LREC 2018, Miyazaki, Japan, May 7-12, 2018, European Language Resources Association (ELRA) (2018)
-
Wordnet Development Using a Multifunctional Tool
Ivan Obradović, Ranka Stanković (2007)In this paper we present a multifunctional tool for manipulating heterogeneous language resources. The tool handles electronic dictionaries, wordnets and aligned texts, and provides for their synchronous use in various tasks. We focus here on the description of the possibilities this tool offers in the development of wordnets. Besides the wordnet module which enables parallel handling of two wordnets, other modules, such as the module for morphological dictionaries and the module for aligned texts, as well as available finite ...... string(s) as their part. On the other hand, the user can use an Xpath expression to retrieve synsets on basis of various other criteria, such as the domain synsets belong to. Thus, for instance, by means of the Xpath expression: “//SYNSET[DOMAIN='geology']” the user can retrieve all synsets from ...
... Laboratory. They started to develop PWN as a linguistic database that maps the way the mind stores and uses language, namely as some sort of a mental lexicon to be used in the scope of psycholinguistic research projects [6]. PWN was formalized as a semantic network of concepts, abstract ideas or mental ...
... new entries. A new entry can be generated from scratch or by copying an existing lemma, which in some cases facilitates the work. The regular expression or a FST graph describing the inflectional properties of the selected lemma can be inspected and corrected if found inadequate. An important ...Ivan Obradović, Ranka Stanković. "Wordnet Development Using a Multifunctional Tool" in Proceedings of the International Workshop Computer Aided Language Processing (CALP) '2007, Borovets, Bulgaria, September 2007, - (2007)
-
Sentiment Analysis of Serbian Old Novels
In this paper we present first study of Sentiment Analysis (SA) of Serbian novels from the 1840-1920 period. The preparation of sentiment lexicon was based on three existing lexicons: NRC, AFFIN and Bing with additional extensive corrections. The first phase of dataset refinement included filtering the word that are not found in Serbian morphological dictionary and in second automatic POS tagging and lemma were manually corrected. The polarity lexicon was extracted and transformed into ontolex-lemon and published as initial ...Ranka Stanković, Miloš Košprdić, Milica Ikonić Nešić, Tijana Radović. "Sentiment Analysis of Serbian Old Novels" in Proceedings of the 2nd Workshop on Sentiment Analysis and Linguistic Linked Data, June 2022, Marseille, France, European Language Resources Association (2022)
-
A Twitter Corpus and Lexicon for Abusive Speech Detection in Serbian
Uvredljivi govor na društvenim medijima, uključujući psovke, pogrdni govor i govor mržnje, dostigao je nivo pandemije. Sistem koji bi bio u stanju da detektuje takve tekstove mogao bi da pomogne da internet i društveni mediji postanu bolji virtuelni prostor sa više poštovanja. Istraživanja i komercijalna primena u ovoj oblasti do sada su bili fokusirani uglavnom na engleski jezik. Ovaj rad predstavlja rad na izgradnji AbCoSER-a, prvog korpusa uvredljivog govora na srpskom jeziku. Korpus se sastoji od 6.436 ručno označenih ...... 2023-10-14 04:19:42 A Twitter Corpus and Lexicon for Abusive Speech Detection in Serbian Danka Jokić, Ranka Stanković, Cvetana Krstev, Branislava Šandrih Дигитални репозиторијум Рударско-геолошког факултета Универзитета у Београду [ДР РГФ] A Twitter Corpus and Lexicon for Abusive Speech Detection in Serbian ...
... present an abusive speech lexicon structure and its enrichment with abusive triggers extracted from the AbCoSER dataset. 2012 ACM Subject ClassiĄcation Computing methodologies → Natural language processing Keywords and phrases abusive language, hate speech, Serbian, Twitter, lexicon, corpus Digital Object ...
... offensive word lexicon and then collected Twitter messages that contain at least one word from it. They concluded that the presence of a word in a tweet just indicates the possibility of offensive speech, and manual annotation is necessary to guarantee accurate tweets classification. The same lexicon was used ...Danka Jokić, Ranka Stanković, Cvetana Krstev, Branislava Šandrih. "A Twitter Corpus and Lexicon for Abusive Speech Detection in Serbian" in 3rd Conference on Language, Data and Knowledge (LDK 2021), MDPI AG (2021). https://doi.org/10.4230/OASIcs.LDK.2021.13
-
From DELA Based Dictionary to Leximirka Lexical Database
Biljana Lazić, Mihailo Škorić (2020)In this paper, we will present an approach in transforming Serbian language Morphological dictionaries from a DELA text format to a lexical database dubbed Leximirka. Considering the benefits of storing data within a database when compared to storing them in textual documents, we will outline some of the functionality that the database has made possible. We will also show how hand-made rules that use category labels lexical entries are marked with can be used to link lexical entries. ...... language that provides lin- guistic description. The LMF consists of mandatory Core package and ad- ditional packages: Morphology Extension, NLP Multiword Expression Pat- terns, Machine Readable Dictionary, NLP syntax, NLP Semantic Extension and NLP Мultilingual Notations. LMF is suitable for encoding morpho- ...
... information is in the MorfPat- tern table, while the information about the dictionary to which the lexical entry belongs is in the Lexicon table. For one entry in the Lexicon table, that is one dictionary, one or more records of the LexicalEntry table are connected. This means that one or more lexical entries ...
... Documentaire et Linguistique. 82 Infotheca Vol. 19, No. 2, December 2019 Scientific paper Morphological dictionaries consist of both simple and multiword units. The basic components of the simple word morphological vocabulary system are DELAS (fr. DELA de formes simple) and DELAF (fr. DELA de formes ...Biljana Lazić, Mihailo Škorić. "From DELA Based Dictionary to Leximirka Lexical Database" in Infotheca, Faculty of Philology, University of Belgrade (2020). https://doi.org/10.18485/infotheca.2019.19.2.4
-
Using Lexical Resources for Irony and Sarcasm Classification
The paper presents a language dependent model for classification of statements into ironic and non-ironic. The model uses various language resources: morphological dictionaries, sentiment lexicon, lexicon of markers and a WordNet based ontology. This approach uses various features: antonymous pairs obtained using the reasoning rules over the Serbian WordNet ontology (R), antonymous pairs in which one member has positive sentiment polarity (PPR), polarity of positive sentiment words (PSP), ordered sequence of sentiment tags (OSA), Part-of-Speech tags of words (POS) ...... that carry positive sentiment polarity. We have used in this research the sentiment lexicon de- veloped for sentiment analysis and described in [24]. The lexicon contains 4,593 entries with sentiment polarity values. Lexicon of irony markers (resource B, Fig. 1) which consists of 62 phrases, whose examples ...
... for classification of statements into ironic and non-ironic. The model uses various language resources: morphological dictionaries, sentiment lexicon, lexicon of markers and a WordNet based ontology. This approach uses various features: antonymous pairs obtained using the rea- soning rules over the Serbian ...
... of irony markers in lexicon form (resource B in Fig. 1) is a part of the architecture of the suggested model. Ironic tweet classifier (Fig 1) for the purpose of feature construc- tion uses: (1) a set of antonymous pairs (a, z) obtained from the SWN ontology (resource D) a lexicon of irony markers (resource ...Miljana Mladenović, Cvetana Krstev, Jelena Mitrović, Ranka Stanković. "Using Lexical Resources for Irony and Sarcasm Classification" in Proceedings of the 8th Balkan Conference in Informatics (BCI '17), New York, NY, USA, : ACM (2017). https://doi.org/
-
Towards Automatic Definition Extraction for Serbian
U radu su prikazani preliminarni rezultati automatske ekstrakcije kandidata za definicije rečnika iz nestrukturiranih tekstova na srpskom jeziku u cilju ubrzanja razvoja rečnika. Definicije u rečniku Srpske akademije nauka i umetnosti (SANU) korišćene su za modelovanje različitih tipova definicija (opisnih, gramatičkih, referentnih i sinonimskih) koje imaju različite sintaksičke i leksičke karakteristike. Korpus istraživanja sastoji se od 61.213 definicija imenica, koje su analizirane korišćenjem morfoloških e-rečnika i lokalnih gramatika implementiranih kao pretvarači konačnih stanja u paketu za obradu korpusa otvorenog ...... which can be a modification of the source text by adding tags for types of recognized 1 Unitex/GramLab - Lexicon-Based Corpus Processing Suite (https://unitexgramlab.org/) 2 A part of this lexicon is publicly available for use within the Unitex system words or a recognized syntactic structure (Vitas ...
... According to the standard “ISO 1087:2019 (en) Terminology work and terminology science — Vocabulary”, a definition is “representation of a concept by an expression that describes it and differentiates it from related concepts”. This standard distinguishes intentional definition, that conveys the intention ...
... We will illustrate a few models given in the guidelines: if a noun is to be defined by a relative clause, the definition should begin with the expression “онај који ...” (“one who ...”), as opposed to adjectives where definitions begin with “који...” (“which ...”). For abstract nouns ending with -ост ...Ranka Stanković, Cvetana Krstev, Rada Stijović, Mirjana Gočanin, Mihailo Škorić. "Towards Automatic Definition Extraction for Serbian" in Proceedings of the XIX EURALEX Congress of the European Assocition for Lexicography: Lexicography for Inclusion (Volume 2). 7-9 September (virtual), Democritus University of Thrace (2021)
-
Development and Evaluation of Three Named Entity Recognition Systems for Serbian - The Case of Personal Names
In this paper we present a rule- and lexicon-based system for the recognition of Named Entities (NE) in Serbian news paper texts that was used to prepare a gold standard annotated with personal names. It was further used to prepare training sets for four different levels of annota tion, which were further used to train two Named Entity Recognition (NER) sys tems: Stanford and spaCy. All obtained models, together with a rule- and lexicon based system were evaluated on ...... spaCy. All obtained models, together with a rule- and lexicon- based system were evaluated on two sam- ple texts: a part of the gold standard and an independent newspaper text of approx- imately the same size. The results show that rule- and lexicon-based system out- performs trained models in all four ...
... Stanković University of Belgrade Faculty of Mining and Geology Belgrade, Serbia ranka@rgf.bg.ac.rs Abstract In this paper we present a rule- and lexicon-based system for the recognition of Named Entities (NE) in Serbian news- paper texts that was used to prepare a gold standard annotated with personal ...
... NER),2 spaCy (Honnibal and Montani, 2017) (module written in Python, used for advanced NLP)3 and many others. For Serbian, thus far a rule-based and lexicon-based NER system was developed – SRPNER (Krstev et al., 2014). Its development started with the recognition of a NE class present in all NE schemes ...Branislava Šandrih, Cvetana Krstev, Ranka Stanković. "Development and Evaluation of Three Named Entity Recognition Systems for Serbian - The Case of Personal Names" in Proceedings - Natural Language Processing in a Deep Learning World, Incoma Ltd., Shoumen, Bulgaria (2019). https://doi.org/10.26615/978-954-452-056-4_122
-
Machine Learning and Deep Neural Network-Based Lemmatization and Morphosyntactic Tagging for Serbian
The training of new tagger models for Serbian is primarily motivated by the enhancement of the existing tagset with the grammatical category of a gender. The harmonization of resources that were manually annotated within different projects over a long period of time was an important task, enabled by the development of tools that support partial automation. The supporting tools take into account different taggers and tagsets. This paper focuses on TreeTagger and spaCy taggers, and the annotation schema alignment ...... likely Part-of-Speech tag” and “simply concatenates lemma from a full lexicon, which corresponds to the chosen Part-of-Speech. Hence, word forms with the same Part-of-Speech, but different lemma cannot coexist in the full lexicon.” A new TreeTagger was produced for this research – TT19, based on ...
... difference being the set of resources used for training. Both the train- ing corpus and the lexicon were expanded. Several smaller annotated corpora were added to Intera: 1984, Švejk and Floods, and the lexicon was expanded to over 2.1+ million tokens (including punctuation and other non-alphanumeric ...
... ML-tagger based on Hid- den Markov Models (HMMs) that uses decision trees for smoothing (Schmid, 1999). A manually annotated train- ing corpus, a full lexicon containing all allowed pairs (PoS, lemma) assigned to particular token and a list of PoS- tags related to open class words are required to automati- ...Ranka Stanković, Branislava Šandrih, Cvetana Krstev, Miloš Utvić, Mihailo Škorić. "Machine Learning and Deep Neural Network-Based Lemmatization and Morphosyntactic Tagging for Serbian" in Proceedings of the 12th Language Resources and Evaluation Conference, May Year: 2020, Marseille, France, European Language Resources Association (2020)
-
The Nooj System as Module within an Integrated Language Processing Environment
... Figure 4. The main form for wordnet management The user can also use an Xpath expression to retrieve synsets on basis of various other criteria, such as the domain synsets belong to. Thus, for instance, with the expression: “//SYNSET[DOMAIN='administration']” the user can retrieve all synsets from ...
... To that end the query has to be in the form of a graph or a regular expression. While regular expressions can be dynamically generated, graphs must be prepared through the NooJ interface. Once the graph or regular expression is ready, NooJ offers two possibilities for their application to a text: ...
... a correspoding wordnet synset and transformed to NooJ regular expression+ . Expanded query is saved in “nox” type of syntactic resource. Figure 12. An example of the search with а NooJ regular expression With regular expressions user can use more general patterns since ... Ranka Stanković, Duško Vitas, Cvetana Krstev. "The Nooj System as Module within an Integrated Language Processing Environment" in Proceedings of the 2007 International Nooj Conference, Cambridge Scholars Publishing (2008)
-
EUROLAN 2021: Introduction to Linked Data for Linguistics Online Training School
Prva škola za obuku polaznika koju je organizovala COST akcija NexusLinguarum održana je od 8. do 12. februara 2021. godine sa ciljem da studenti, istraživači i stručnjaci nauče osnove lingvističke nauke o podacima. Tokom obuke polaznici su se upoznali sa širokim spektrom tema: od semantičkog veba, RDF -a i ontologija, do modeliranja i pretraživanja jezičkih podataka pomoću najsavremenijih ontoloških modela i alata. Škola je održana u okviru serije letnjih škola EUROLAN-a i organizovalo ju je virtuelno (onlajn) nekoliko instituta; ...nauka o lingvističkim podacima, povezani podaci u lingvistici, jezički podaci, EUROLAN, NexusLinguarum, COST akcija, škola za obuku... 2014)), lexicog11 – lexicography module (Bosque-Gil, Gracia, and Montiel- 6. Data Catalog Vocabulary (DCAT) - Version 2 7. Lemon - Lexicon Model for Ontologies; Lexicon Model for Ontologies: Com- munity Report, 10 May 2016 8. SKOS Simple Knowledge Organization System - home page 9. Protégé 10. VocBench: ...
... Belgradefor the subjects Knowledge repre- sentation and Semantic web. The Lemon-OntoLex Frac module was used for representation of the entries from the lexicon used for abusive speech detec- tion with attestations from the Twitter corpus with annotation of abusive spans (Jokić et al. 2021). 3 Organization ...
... al Semantic Web Conference, 98–113. Springer. Jokić, Danka, Ranka Stanković, Cvetana Krstev, and Branislava Šandrih. 2021. “A Twitter Corpus and lexicon for abusive speech detection in Serbian.” In Proceedings of the 2021 Language, Data and Knowledge (LDK), 1-3 September in Zaragoza, Spain. McCrae ...Milan Dojchinovski, Julia Bosque Gil, Jorge Gracia, Ranka Stanković. "EUROLAN 2021: Introduction to Linked Data for Linguistics Online Training School" in Infotheca, Faculty of Philology, University of Belgrade (2021). https://doi.org/10.18485/infotheca.2021.21.1.7
-
Football terminology: compilation and transformation into OntoLex-Lemon resource
У овом раду представља се пројекат који је у развоју, креирање првог дигиталног фудбалског речника на српском језику, као и да демонстрација примене модела OntoLex и љегових модула. OntoLex-FrAC модул укључује информације о учесталости и примерима употребе екстрахованих из корпуса. У овом случају, креиран је корпус за специфичан домен под називом СрФудКо, који садржи чланке вести о фудбалу на српском језику. Вишечлани термини аутоматски су екстраховани из српског корпуса, а затим ручно евалуирани и класификовани као спортски или ...Jelena Lazarević, Ranka Stanković, Mihailo Škorić, Biljana Rujević. "Football terminology: compilation and transformation into OntoLex-Lemon resource" in LDK 2023 – 4th Conference on Language, Data and Knowledge, 12-15 September in Vienna, Austria, Lisabon : NOVA FCSH - CLUNL (2023). https://doi.org/10.34619/srmk-injj