92 items
Keyword-Based Search on Bilingual Digital Libraries
This paper outlines the main features of Biblisha, a tool that offers various possibilities of enhancing queries submitted to large collections of aligned parallel text residing in bilingual digital library. Biblishsa supports keyword queries as an intuitive way of specifying information needs. The keyword queries initiated, in Serbian or English, can be expanded, both semantically, morphologically and in other language, using different supporting monolingual and bilingual resources. Terminological and lexical resources are of various types, such as wordnets, electronic ...Ranka Stanković, Cvetana Krstev, Duško Vitas, Nikola Vulović, Olivera Kitanović. "Keyword-Based Search on Bilingual Digital Libraries" in Semantic Keyword-Based Search on Structured Data Sources - Second COST Action IC1302 International KEYSTONE Conference, IKC 2016, Springer (2017). https://doi.org/10.1007/978-3-319-53640-8_10
Vebran Web Services for Corpus Query Expansion
Ranka Stanković, Miloš Utvić (2020)U ovom radu se govori o razvoju veb usluga Vebran i njihovoj primeni u poboljšanju pretraživanja korpusa. Veb-servisi Vebran koriste se za konsultovanje spoljnih leksičkih izvora za srpski jezik (uglavnom elektronski morfološki rečnici i srpski Vordnet) i proširivanje korisničkih upita radi dobijanja relevantnijih rezultata iz srpskih korpusa.... paragraph, sentence) are annotated in some particular corpus texts, especially those which are part of aligned corpora. The SrpKor2013 corpus is used by more than 700 users, mostly Slavists. 2.2 RudKor Systematic collection and preparation of texts from the mining domain started with English-Serbian alignment ...
... 122 million corpus words. It includes literary texts of Serbian writers in the XX and XXI centuries, as well as scientific and popular science texts from different domains (natural and so- cial sciences), administrative and general texts. The general texts represent articles from the daily newspapers ...
... subset SrpLemKor2; – SrpEngKor3, aligned English-Serbian corpus including subcorpus SELFEH (Serbian-English Law Finance Education and Health) with documents on finance, health, law and education; – SrpFranKor4, aligned French-Serbian corpus; – SrpNemKor5, aligned German-Serbian corpus; – RudKor6, a ...Ranka Stanković, Miloš Utvić. "Vebran Web Services for Corpus Query Expansion" in Infotheca, Faculty of Philology, University of Belgrade (2020). https://doi.org/10.18485/infotheca.2019.19.2.5
A Lexical Approach to Acronyms and their Definitions
In this paper we present a comprehensive approach to acronyms for Natural-Language Processing (NLP) of Serbian texts. The proposed procedure includes extraction of acronyms and their definitions that are usual Multi-Word Units (MWUs), shallow parsing of MWUs that enables MWU lemmatization and production of entries in morphological electronic dictionaries, both for MWU and acronyms, that are provided with grammatical, syntactic, semantic and domain information. This approach enables representation that reflects complex relations between acronyms and their definitions.... contains 70% of newspaper texts (57% daily, 8% weekly and 5% monthly newspapers) and 6% of monographs and textbooks (Krstev and Vitas, 2005), which are types of texts that tend to use acronyms and pro- vide definitions. Besides that we used two more samples of newspaper texts (having 600 thousand and ...
... Sgarbas, and S. Panagiotopoulou, 2014. Acronym identification in Greek legal texts. Literary and Linguistic Computing, 30(3):440–541. Wolinski, F., F. Vichot, and B. Dillet, 1995. Automatic processing of proper names in texts. In Proceedings of the 7th conference on European chapter of the ACL. Morgan ...
... ac.rs, †ranka@rgf.bg.ac.rs Abstract In this paper we present a comprehensive approach to acronyms for Natural-Language Processing (NLP) of Serbian texts. The proposed procedure includes extraction of acronyms and their definitions that are usual Multi-Word Units (MWUs), shallow parsing of MWUs that ...Cvetana Krstev, Duško Vitas, Ranka Stanković. "A Lexical Approach to Acronyms and their Definitions" in Proceedings of the 7th Language & Technology Conference, November 27-29, 2015, Poznań, Poland, Springer (2015)
On the compatibility of lexical resources for NooJ
Lexical resources for many languages are provided for the NooJ linguistic development environment. Meta-data descriptions of morphosyntactic and semantic properties of these languages and their resources are a mandatory part of each language module. In this paper we analyze how well the meta-data actually describe resources for a chosen subset of languages and to what extent are they compatible across languages to support multilingual processing. We show that there is place for improvement in both directions.... applied to texts of Verne’s novel and a linguistic analysis of the results obtained was performed. The analyzed texts were in XML format in compliance with TEI, and their alignment was performed at the sentence level using the ACIDE system (Obradović et al 2008), which can handle aligned texts in various ...
... http://www.meta-net.eu/projects/cesar/ 2 [Type text] texts of Jules Verne’s novel “Around the world in eighty days” in the same languages was performed. These seven languages were selected due to the fact that both NooJ resources and aligned versions of this novel were available for them. The resources ...
... Dictionary Properties’ Definition files. The section that follows outlines the results of lexical analysis of the application of NooJ resources to aligned texts. Finally, a section is dedicated to some related issues of compatibility and standardization. The paper ends with concluding remarks. Comparison ...Ranka Stanković, Miloš Utvić, Duško Vitas, Cvetana Krstev, Ivan Obradović. "On the compatibility of lexical resources for NooJ" in Automatic Processing of Various Levels of Linguistic Phenomena: Selected Papers from the 2011 International Nooj Conference, Cambridge Scholars Publishing (2012): 96-108
Quantitative analysis of syllable properties in Croatian, Serbian, Russian, and Ukrainian
Biljana Rujević, Marija Kaplar, Sebastijan Kaplar, Ranka Stanković, Ivan Obradović, Jan Mačutek (2021)Biljana Rujević, Marija Kaplar, Sebastijan Kaplar, Ranka Stanković, Ivan Obradović, Jan Mačutek. "Quantitative analysis of syllable properties in Croatian, Serbian, Russian, and Ukrainian" in Language and Text: Data, models, information and applications, John Benjamins Publishing Company (2021). https://doi.org/10.1075/cilt.356.04ruj
New Language Models for South Slavic Languages
Mihailo Škorić (2024)Izlaganje će predstaviti izazove i perspektive modelovanja južnoslovenskih jezika, sa posebnim osvrtom opšte jezičke modele građene na arhitekturi transformera (BERT, GPT), na dostupne skupove tekstova za obučavanje tih modela, te kvantitet i kvalitet tih skupova. Izlaganje će ponuditi pregled dostupnih skupova i modela, dok će posebna pažnja biti posvećena najnovijim korpusima tekstova. Prvi korpus, Kišobran, predstavlja krovni veb korpus južnoslovenskih jezika i ujedno trenutno najveći korpus tekstova na našim prostorima koji broji preko osamnaest milijardi reči i uključuje sve ...Mihailo Škorić. "New Language Models for South Slavic Languages" in South Slavic Languages in the Digital Environment JuDig Book of Abstracts, University of Belgrade - Faculty of Philology, Serbia, November 21-23, 2024, University of Belgrade - Faculty of Philology (2024)
Managing mining project documentation using human language technology
Purpose: This paper aims to develop a system, which would enable efficient management and exploitation of documentation in electronic form, related to mining projects, with information retrieval and information extraction (IE) features, using various language resources and natural language processing. Design/methodology/approach: The system is designed to integrate textual, lexical, semantic and terminological resources, enabling advanced document search and extraction of information. These resources are integrated with a set of Web services and applications, for different user profiles and use-cases. Findings: The ...Digital libraries, Information retrieval, Data mining, Human language technologies, Project documentationAleksandra Tomašević, Ranka Stanković, Miloš Utvić, Ivan Obradović, Božo Kolonja . "Managing mining project documentation using human language technology" in The Electronic Library (2018). https://doi.org/10.1108/EL-11-2017-0239
Towards Semantic Interoperability: Parallel Corpora as Linked Data Incorporating Named Entity Linking
U radu se prikazuju rezultati istraživanja vezanih za pripremu paralelnih korpusa, fokusirajući se na transformaciju u RDF grafove koristeći NLP Interchange Format (NIF) za lingvističku anotaciju. Pružamo pregled paralelnog korpusa koji je korišćen u ovom studijskom slučaju, kao i proces označavanja delova govora, lematizacije i prepoznavanja imenovanih entiteta (NER). Zatim opisujemo povezivanje imenovanih entiteta (NEL), konverziju podataka u RDF, i uključivanje NIF anotacija. Proizvedene NIF datoteke su evaluirane kroz istraživanje triplestore-a korišćenjem SPARQL upita. Na kraju, razmatra se povezivanje Linked ...paralelni korpusi, povezivanje imenovanih entiteta, prepoznavanje imenovanih entiteta, NER, NEL, povezani podaci, NIF, VikipodaciRanka Stanković, Milica Ikonić Nešić, Olja Perisic, Mihailo Škorić, Olivera Kitanović. "Towards Semantic Interoperability: Parallel Corpora as Linked Data Incorporating Named Entity Linking" in Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024, Turin, 20-25 May 2024, ELRA and ICCL (2024)
Development of Open Educational Resources (OER) for Natural Language Processing
In this paper we present the development of an online course at the edX BAEKTEL platform named “Lexical Recognition in the Natural Language Processing (NLP)”. It is based on the course of the same name for PhD studies at the University of Belgrade, Faculty of Philology. There are not many courses in Computational Linguistics (CL) on OER platforms, and there is none in Serbian either for CL or NLP. We have developed this course in order to improve this ...... queries in form of regular expressions and graphs; text transformations; processing of monolingual and bilingual texts (bi- texts in which basic segments are aligned). Unitex is freely distributed under the terms of the Lesser General Public License (LGPL). This means that everyone ...
... language utterances, as well as enabling various forms of human-machine interaction. It becomes very important in view of the rising amount of texts and data on the web. The term NLP is also used to describe the function of software or hardware components in a computer system which analyze ...
... various languages, but mainly in Serbian, both in video and audio format, but also in written form as parallel (multilingual) corpora of lessons and texts, supported by electronic terminological resources[10], services, and functionalities for searching and browsing of terminological resources and ...Cvetana Krstev, Biljana Lazić, Ranka Stanković, Giovanni Schiuma, Miladin Kotorčević. "Development of Open Educational Resources (OER) for Natural Language Processing" in The Sixth International Conference on e-Learning (eLearning-2015), September 2015, Belgrade, Serbia, Belgrade : Belgrade Metropolitan Univesity (2015)
A Tel Platform Blending Academic And Entrepreneurial Knowledge
... language support system also handles aligned texts or bitexts, pairs of semantically equivalent texts in different languages, such as an original text and its translation, that are aligned on a structural level (paragraph, sentence, phrase, etc.). Aligned texts in BAEKTEL enable better understanding ...
... understanding of OER and follow the standard format for representing aligned texts, the Translation Memory eXchange format (TMX) that is XML-compliant. It should finally be mentioned that due to the complex Serbian grammar the language support system also features grammars implemented ...
... the multilingual approach, the BAEKTEL platform provides electronic terminological resources, parallel (multilingual) corpora of lessons and texts in written form, and functionalities for searching and browsing of terminological resources and using them for text annotation. The contents of ...Ivan Obradović, Ranka Stanković, Jelena Prodanović, Olivera Kitanović. "A Tel Platform Blending Academic And Entrepreneurial Knowledge" in Proceedings of the The Fourth International Conference on e-Learning (eLearning-2013), September 2013, Belgrade, Serbia, Belgrade, Serbia : Belgrade Metropolitan University (2013)
Knowledge and Rule-Based Diacritic Restoration in Serbian
In this paper we present a procedure for the restoration of diacritics in Serbian texts written using the degraded Latin alphabet. The procedure relies on the comprehensive lexical resources for Serbian: the morphological electronic dictionaries, the Corpus of Contemporary Serbian and local grammars. Dictionaries are used to identify possible candidates for the restoration, while the dataobtainedfromSrpKorandlocalgrammarsassistsinmakingadecisionbetween several candidates in cases of ambiguity. The evaluation results reveal that,dependingonthetext,accuracyrangesfrom95.03%to99.36%,whilethe precision (average 98.93%) is always higher than the recall (average 94.94%).... concepts, not between terms (Clarke and Zeng, 2012). However, information-retrieval thesauri are not intended for use in automatic processing of texts: they should be used in manual indexing by human experts for improvement of information retrieval in physical or digital libraries. Thus, there exist ...
... news flows because they are, in fact, lists of selected keywords, denoting the most significant concepts of the domain, with low coverage of real texts (Mdivani, 2013). There are also several Russian versions of international information-retrieval thesauri or controlled vocabularies (Lipscomb, 2000) ...
... are more similar to information-retrieval thesauri guidelines (NISO, 2005). Each concept is linked with words and phrases conveying the concept in texts (text entries). Detailed description of lexical units (words in specific senses), representation of senses of ambiguous words are closer to wordnets ...Cvetana Krstev, Ranka Stanković, Duško Vitas. "Knowledge and Rule-Based Diacritic Restoration in Serbian" in Proceedings of the Third International Conference Computational Linguistics in Bulgaria (CLIB 2018), May 27-29, 2018, Sofia, Bulgaria, Sofia : The Institute for Bulgarian Language Prof. Lyubomir Andreychin, Bulgarian Academy of Sciences (2018): 41-51
Using technology for knowledge transfer between academia and enterprises
Ivan Obradović, Ranka Stanković (2014)... In addition to that, textual resources feature aligned texts an corpora. Aligned texts are pairs of texts in different languages, mainly an original and its translation, aligned on some structural level, most often the sentence. Aligned texts in LSS are in the standard, Translation Memory eXchange ...
... eXchange (TMX) format, which is XML-compliant. Corpora are large and structured sets of texts, both monolingual and multilingual, the latter often composed of aligned texts. Finally the web itself represents a textual resource that LSS makes use of. Specific features of Serbian grammar need c ...
... Serbian Wordnet. Romanian Journal of Information Science and Technology, 7(1-2), 147-161. Krstev C., (2008). Processing of Serbian – Automata, Texts and Electronic dictionaries. Faculty of Philology, University of Belgrade, Belgrade. Lee, W. O. (2008). The repositioning of high education from ...Ivan Obradović, Ranka Stanković. "Using technology for knowledge transfer between academia and enterprises" in Knowledge and Management Models for Sustainable Growth, Proc. of IFKAD 2014, 9th International Forum on Knowledge Asset Dynamics, 11-13 June 2013, Matera, Italy, Bari : IFKAD (2014)
Rule-based Automatic Multi-word Term Extraction and Lemmatization
In this paper we present a rule-based method for multi-word term extraction that relies on extensive lexical resources in the form of electronic dictionaries and finite-state transducers for modelling various syntactic structures of multi-word terms. The same technology is used for lemmatization of extracted multi-word terms, which is unavoidable for highly inflected languages in order to pass extracted data to evaluators and subsequently to terminological e-dictionaries and databases. The approach is illustrated on a corpus of Serbian texts from ...... pass extracted data to evaluators and subsequently to terminological e-dictionaries and databases. The approach is illustrated on a corpus of Serbian texts from the mining domain containing more than 600,000 simple word forms. Extracted and lemmatized multi-word terms are filtered in order to reject falsely ...
... (MWT) extraction as this problem has been gaining in importance in the field of Natural Language Processing. Initially, MWT extraction from domain texts has been tackled mainly using the statistical approach based on different statistical measures, following the seminal work of Kenneth Church and ...
... documents (Chen et al., 2006). Statistical measures of co-occurrence (MI3 – mutual information) were used for finding MWT candidates in Croatian texts (Tadić&Šojat, 2003). Although the statistical approach has been steadily pursued by a number of researchers, development of lexical resources ...Ranka Stanković, Cvetana Krstev, Ivan Obradović, Biljana Lazić, Aleksandra Trtovac. "Rule-based Automatic Multi-word Term Extraction and Lemmatization" in Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016, Portorož, Slovenia, 23--28 May 2016, European Language Resources Association (2016)
SrpELTeC: A Serbian Literary Corpus for Distant Reading
U članku je predstavljen SrpELTeC, korpus razvijen u okviru akcije COST Distant Reading for European Literary History (CA16204). Svi romani u SrpELTeC-u su odabrani, pripremljeni i obeleženi korišćenjem zajedničkih principa uspostavljenih za sve jezičke zbirke u Evropskoj zbirci književnog teksta (ELTeC). Navedeni su izazovi i rešenja u pripremi SrpELTeC od nule. Svi romani su ručno kodirani u TEI sa bogatim metapodacima i strukturnim napomenama. Automatska anotacija je uključivala POS-označavanje, lematizaciju i imenovane entitete, oslanjajući se na resurse za obradu ...digital humanities, Serbian literature, text corpora, distant reading , linked data, named entity recognition, text analyticsRanka Stanković, Cvetana Krstev, Duško Vitas. "SrpELTeC: A Serbian Literary Corpus for Distant Reading" in Primerjalna književnost, Research Centre of the Slovenian Academy of Sciences and Arts (2024). https://doi.org/10.3986/pkn.v47.i2.03
A Mathematical Learning Environment Based on Serbian Language Resources
In recent years, in line with ever growing usage of Information technology, the learning environments are changing. The amount of available learning materials in various forms has increased. These new environments demand comprehensive learning systems, which enable management of the learning corpus with special attention paid to relevant lexical resources. In this paper we present the concept of a Mathematical Learning Environment in Serbian (MLES), which is based on a corpus of mathematical materials and various lexical resources, enabling ...... challenge to corpus processing results from the use of two alphabets: Latin and Cyrillic, with different coding schemas and formats of source texts, as well as from various ways of expressing mathematical content. In order to resolve the problem of two alphabets, the entire corpus is tran ...
... real life problems from engineering practice based on mathematical concepts (Figure 3). Results of the third component are annotated and linked texts, where every mathematical term in the text is linked to the appropriate dictionary entry or relevant corpus content related to that term. This ...
... GOALS AND CHALLENGES Searching and processing mathematical materials is a complex problem. Standard text processors cannot recognize mathematical texts in a proper way. There is thus a need for developing new and adapting existing processors for that purpose. Processing of mathematical content ...Radojičić Marija, Obradović Ivan, Stanković Ranka, Utvić Miloć, Kaplar Sebastijan. "A Mathematical Learning Environment Based on Serbian Language Resources" in Proceedings of the 7th International Scientific Conference Technics and Informatics in Education, Faculty of Technical Sciences, Čačak (2018)
Integrisano okruženje za pripremu paralelizovanog korpusa
Razvoj paralelizovanih korpusa zahteva pripremu paralelnih tekstova za njihovu integraciju u paralelizovani korpus. Reč je o jednom kompleksnom zadatku koji se može rešiti na različite načine, i koji mora da se odvija u nekoliko koraka. U ovom radu najpre je iznet postupak pripreme paralelnih tekstova za paralelizovani korpus koji se koristi u Grupi za jezičke tehnologije Univerziteta u Beogradu. Potom je dat kratak pregled programa (XAlign, Concordancier, WS4LR), odnosno softverskih alata koji se pri tome koriste. Nedostatak udobnog okruženja ...... the IJS-ELAN Parallel Corpus. Informatica, 26(3), pp. 299-307, 2002. SUMMARY The development of aligned corpora requires a preparation of parallel texts for their integration into aligned corpora. This is a very complex task, which can be solved in different ways, and which has to be realized ...
... steps. At the beginning of this paper we outline the procedure for preparation of parallel texts for aligned corpora which is being used in the Human Language Technology Group at the University of Belgrade. Texts are marked using XML tags, in accordance with the TEI (Text Encoding Initiative) consortium ...
... environment for the preparation of aligned corpora, under the name of ACIDE. For the construction of this environment we chose the C# programming language. Among other things, ACIDE provides a graphical user interface (GUI) for alignment and visualization of aligned texts, their control and correction ...Ivan Obradović, Ranka Stanković, Miloš Utvić. "Integrisano okruženje za pripremu paralelizovanog korpusa" in Zbornik radova međunarodnog simpozijuma Razlike između bosanskog/bošnjačkog, hrvatskog i srpskog jezika, Graz, Austria, April 2007, - (2007)
Building learning capacity by blending different sources of knowledge
... of storing specific textual resources, such as aligned texts and corpora. Aligned texts are pairs of texts in different languages, mainly an original and its translation, aligned on some structural level, most often the sentence. Aligned texts in BMP are in the standard, Translation Memory eXchange ...
... eXchange (TMX) format, which is XML-compliant. Corpora are large and structured sets of texts, both monolingual and multilingual, the latter often composed of aligned texts. Finally the World Wide Web itself represents a textual resource that BMP language support system makes use of. The ...
... In Digital Repositories: Practices and Perspectives, D-Lib Magazine, Volume 16, Number 1/2. Krstev C., (2008). Processing of Serbian – Automata, Texts and Electronic dictionaries. Faculty of Philology, University of Belgrade, Belgrade. Lee, W. O. (2008). The repositioning of high education from ...Ivan Obradović, Ranka Stanković, Olivera Kitanović, Dalibor Vorkapić. "Building learning capacity by blending different sources of knowledge" in International Journal of Learning and Intellectual Capital (2016). https://doi.org/10.1504/IJLIC.2016.075698
Towards Automatic Definition Extraction for Serbian
U radu su prikazani preliminarni rezultati automatske ekstrakcije kandidata za definicije rečnika iz nestrukturiranih tekstova na srpskom jeziku u cilju ubrzanja razvoja rečnika. Definicije u rečniku Srpske akademije nauka i umetnosti (SANU) korišćene su za modelovanje različitih tipova definicija (opisnih, gramatičkih, referentnih i sinonimskih) koje imaju različite sintaksičke i leksičke karakteristike. Korpus istraživanja sastoji se od 61.213 definicija imenica, koje su analizirane korišćenjem morfoloških e-rečnika i lokalnih gramatika implementiranih kao pretvarači konačnih stanja u paketu za obradu korpusa otvorenog ...... should be used for extraction from unstructured texts than are necessary when modelling dictionary definitions. 5 Conclusion The paper presents preliminary results of the automatic extraction of candidates for dictionary definitions from unstructured texts in the Serbian language, with the aim of a ...
... Serbia Abstract The paper presents preliminary results of the automatic extraction of candidates for dictionary definitions from unstructured texts in the Serbian language with the aim of accelerating dictionary development. Definitions in the Serbian Academy of Sciences and Arts (SASA) dictionary ...
... (2019) associate a detailed annotation scheme with the corpus in order to explore diverse structures of term definitions in free and semi-structured texts. In addition to the basic concept (Term) and its main definitions (Definition), sentence segments containing pseudonyms or additional names (Alias ...Ranka Stanković, Cvetana Krstev, Rada Stijović, Mirjana Gočanin, Mihailo Škorić. "Towards Automatic Definition Extraction for Serbian" in Proceedings of the XIX EURALEX Congress of the European Assocition for Lexicography: Lexicography for Inclusion (Volume 2). 7-9 September (virtual), Democritus University of Thrace (2021)
Using Query Expansion for Cross-Lingual Mathematical Terminology Extraction
Velislava Stoykova, Ranka Stanković (2018)Velislava Stoykova, Ranka Stanković. "Using Query Expansion for Cross-Lingual Mathematical Terminology Extraction" in Advances in Intelligent Systems and Computing, Springer International Publishing (2018). https://doi.org/10.1007/978-3-319-91189-2_16
An Approach to Development of Bilingual Lexical Resources
... [Information Storage and Retrieval]: Digital Libraries – Collection General Terms Documentation, Languages Keywords Digital libraries, aligned parallel texts, TMX document collections, multilingual lexical resources, bilingual search 1. INTRODUCTION Multilingual information exchange is growing ...
... collection was generated from INFOtheca articles using another of our tools, named ACIDE, an integrated development environment for generating aligned parallel texts [Obradović et al., 2008]. As for available lexical resources, we had at our disposal Serbian morphological e-dictionaries [Krstev, 2008] ...
... wordnets connected via the interlingual index, and a bilingual Dictionary of Librarianship, as well as on a TMX document collection generated from aligned Serbian-English journal articles published in INFOtheca, a scientific journal in the area of Library and Information Sciences. The aim of the new ...Stanković Ranka, Obradović Ivan, Trtovac Aleksandra. "An Approach to Development of Bilingual Lexical Resources" in Proceedings of the Fifth Balkan Conference in Informatics BCI 2012, Workshop on Computational Linguistics and Natural Language Processing of Balkan Languages – CLoBL 2012, September 2012, Novi Sad : BCI (2012)