Indexing of textual databases based on lexical resources: A case study for Serbian
In this paper we describe an approach to improvement of information retrieval results for large textual databases by pre-indexing documents using bag-of-words and Named Entity Recognition. The approach was applied on a database of geological projects financed by the Republic of Serbia in the last half century. Each document within this database is described by metadata, consisting of several fields such as title, domain, keywords, abstract, geographical location and the like. A bag of words was produced from these ...... However, a large number of other forms cannot be found by scanning the text, for example, the form zlata (genitive singular) cannot be aligned with the query keyword key zlato (nominative singular). The disadvantage of the system based on text scanning which affects the precision is especially visible when ...
... improved ranking uses tf idf measure that is based on frequencies of words allocated to the text, text length, and the document frequency [8]. Index- ing is performed in following steps: 1. Generating a Di text from several records and fields in the database related to a particular document or project; ...
... Query Language) form. The query generated in such a way searches the text of the subset of attributes in the database that correspond to the selected criteria of search. 4 The Improved Solution One of the problems of full text search in Serbian is its rich morphology, where the keyword for search ...Ranka Stanković, Cvetana Krstev, Ivan Obradović, Olivera Kitanović. "Indexing of textual databases based on lexical resources: A case study for Serbian" in Semantic Keyword-based Search on Structured Data Sources : First COST Action IC1302 International KEYSTONE Conference, IKC 2015, Coimbra, Portugal, September 8-9, 2015. Revised Selected Papers, Springer (2015). https://doi.org/10.1007/978-3-319-27932-9_15
A Tool for Enhanced Search of Multilingual Digital Libraries of E-journals
This paper outlines the main features of Bibliša, a tool that offers various possibilities of enhancing queries submitted to large collections of TMX documents generated from aligned parallel articles residing in multilingual digital libraries of e-journals. The queries initiated by a simple or multiword keyword, in Serbian or English, can be expanded by Bibliša, both semantically and morphologically, using different supporting monolingual and multilingual resources, such as wordnets and electronic dictionaries. The tool operates within a complex system composed ...... for each article, links are offered to the full text of the article in .pdf format (residing on the official site of the INFOtheca journal) as well as the entire aligned parallel text of the article in .html format. More powerful is the full-text search (Figure 5). The user initiates this search ...
... describing the aligned texts) and a body, containing a set of translation units (TU) composed of two or more semantically equivalent translation unit variants (TUVs). Each TUV contains the text (one or more sentences or segments) in one of the TMX document languages, where the text in the first ...
... make it more suitable for the search of collections of aligned texts. 1713 5. User interface The user can search the INFOtheca collection in two different ways. A typical search, yielding a set of aligned concordances is a full text search based on a query, and we will discuss this type ...Ranka Stanković, Cvetana Krstev, Ivan Obradović, Aleksandra Trtovac, Miloš Utvić. "A Tool for Enhanced Search of Multilingual Digital Libraries of E-journals" in Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012, May 2012, Istanbul, Turkey, Istanbul, Turkey : European Language Resources Association (2012)
The Effects of Multi-Word Tagging on Text Disambiguation
Utvić Miloš, Obradović Ivan, Krstev Cvetana, Vitas Duško. "The Effects of Multi-Word Tagging on Text Disambiguation" in Proceedings of the 29th International Conference on Lexis and Grammar, LGC 2010, September 2010, Belgrade, Serbia, D. Vitas and C. Krstev (eds.), Belgrade:Faculty of Mathematics, University of Belgrade (2010): 333-342
Development and Evaluation of Three Named Entity Recognition Systems for Serbian - The Case of Personal Names
In this paper we present a rule- and lexicon-based system for the recognition of Named Entities (NE) in Serbian news paper texts that was used to prepare a gold standard annotated with personal names. It was further used to prepare training sets for four different levels of annota tion, which were further used to train two Named Entity Recognition (NER) sys tems: Stanford and spaCy. All obtained models, together with a rule- and lexicon based system were evaluated on ...... al., 2012) is a web-based tool9 for text anno- tation, i.e., for adding notes to existing text doc- uments. It is designed for structured annotation, allowing embedded annotations, which are espe- cially convenient for NER. Annotations are ex- ternal, so for each text file, an additional annota- tion file ...
... Petrović After running STANFORDNER on a text, an output is provided in already mentioned CoNLL02 format. We used CoNLL02 7→ BRAT converter available within NER&BEYOND online tool. Finally, for both SPACY NER and STANAFORDNER output files, we applied ANN + TEXT 7→ XML converter offered by Gemini, also ...
... models, together with a rule- and lexicon- based system were evaluated on two sam- ple texts: a part of the gold standard and an independent newspaper text of approx- imately the same size. The results show that rule- and lexicon-based system out- performs trained models in all four sce- narios (measured ...Branislava Šandrih, Cvetana Krstev, Ranka Stanković. "Development and Evaluation of Three Named Entity Recognition Systems for Serbian - The Case of Personal Names" in Proceedings - Natural Language Processing in a Deep Learning World, Incoma Ltd., Shoumen, Bulgaria (2019). https://doi.org/10.26615/978-954-452-056-4_122
It-Sr-NER: CLARIN Compatible NER and Geoparsing Web Services for Italian and Serbian Parallel Text
Olja Perišić, Ranka Stanković, Milica Ikonić Nešić, Mihailo Škorić. "It-Sr-NER: CLARIN Compatible NER and Geoparsing Web Services for Italian and Serbian Parallel Text" in Linköping Electronic Conference Proceedings, Linköping University Electronic Press (2023). https://doi.org/10.3384/ecp198010
Managing mining project documentation using human language technology
Purpose: This paper aims to develop a system, which would enable efficient management and exploitation of documentation in electronic form, related to mining projects, with information retrieval and information extraction (IE) features, using various language resources and natural language processing. Design/methodology/approach: The system is designed to integrate textual, lexical, semantic and terminological resources, enabling advanced document search and extraction of information. These resources are integrated with a set of Web services and applications, for different user profiles and use-cases. Findings: The ...Digital libraries, Information retrieval, Data mining, Human language technologies, Project documentationAleksandra Tomašević, Ranka Stanković, Miloš Utvić, Ivan Obradović, Božo Kolonja . "Managing mining project documentation using human language technology" in The Electronic Library (2018). https://doi.org/10.1108/EL-11-2017-0239
A Mathematical Learning Environment Based on Serbian Language Resources
In recent years, in line with ever growing usage of Information technology, the learning environments are changing. The amount of available learning materials in various forms has increased. These new environments demand comprehensive learning systems, which enable management of the learning corpus with special attention paid to relevant lexical resources. In this paper we present the concept of a Mathematical Learning Environment in Serbian (MLES), which is based on a corpus of mathematical materials and various lexical resources, enabling ...... indexed. User requests are also converted to LaTex, and the search proceeds as with ordinary text. Another relevant project is MathGo! that provides search and presentation of mathematical encoded text [8]. The software solution is based on the concepts of math block identification and vector ...
... mathematical support in solving real life problems from engineering practice. To that end complex issues had to be resolved, such as mathematical text analysis, processing of mathematical content in different formats, search of mathematical materials, indexing of mathematical content using Serbian ...
... outlines the structure and solutions for MLES, as well as the main features of its already developed components. Keywords: mathematical content; text processing; mathematical formulae 1. INTRODUCTION Rapid development of information technology, resulting in a growing number and availability ...Radojičić Marija, Obradović Ivan, Stanković Ranka, Utvić Miloć, Kaplar Sebastijan. "A Mathematical Learning Environment Based on Serbian Language Resources" in Proceedings of the 7th International Scientific Conference Technics and Informatics in Education, Faculty of Technical Sciences, Čačak (2018)
Towards Semantic Interoperability: Parallel Corpora as Linked Data Incorporating Named Entity Linking
U radu se prikazuju rezultati istraživanja vezanih za pripremu paralelnih korpusa, fokusirajući se na transformaciju u RDF grafove koristeći NLP Interchange Format (NIF) za lingvističku anotaciju. Pružamo pregled paralelnog korpusa koji je korišćen u ovom studijskom slučaju, kao i proces označavanja delova govora, lematizacije i prepoznavanja imenovanih entiteta (NER). Zatim opisujemo povezivanje imenovanih entiteta (NEL), konverziju podataka u RDF, i uključivanje NIF anotacija. Proizvedene NIF datoteke su evaluirane kroz istraživanje triplestore-a korišćenjem SPARQL upita. Na kraju, razmatra se povezivanje Linked ...paralelni korpusi, povezivanje imenovanih entiteta, prepoznavanje imenovanih entiteta, NER, NEL, povezani podaci, NIF, VikipodaciRanka Stanković, Milica Ikonić Nešić, Olja Perisic, Mihailo Škorić, Olivera Kitanović. "Towards Semantic Interoperability: Parallel Corpora as Linked Data Incorporating Named Entity Linking" in Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024, Turin, 20-25 May 2024, ELRA and ICCL (2024)
The Nooj System as Module within an Integrated Language Processing Environment
... multilingual texts. WS4LR handles aligned texts as well. A pair of semantically equivalent texts in different languages, such as an original text and its translation, that are aligned on a structural level (paragraph, sentence, phrase, etc.) is known as an aligned text or bitext. One of the supported ...
... resources management 4.1. Parallel Text Management The WS4LR module for management of aligned parallel texts uses texts which have previously been aligned using Xalign as an alignment tool (Bonhomme 2001). Parallel texts which usually originate from a text in one language and its translation ...
... 7 depicts the form with different possibilities for TMX document management. Aligned texts can be visualized in various ways by choosing the appropriate XSLT stylesheet. Namely, the user can obtain the aligned text in HTML format, but also in textual, XML, tabular or TMX format. Figure 7. The ...Ranka Stanković, Duško Vitas, Cvetana Krstev. "The Nooj System as Module within an Integrated Language Processing Environment" in Proceedings of the 2007 International Nooj Conference, Cambridge Scholars Publishing (2008)
Distant Reading in Digital Humanities: Case Study on the Serbian Part of the ELTeC Collection
Ranka Stanković, Cvetana Krstev, Branislava Šandrih Todorović, Duško Vitas, Mihailo Škorić, Milica Ikonić Nešić (2022)In this paper we present the Serbian part of the ELTeC multilingual corpus of novels written in the time period 1840-1920. The corpus is being built in order to test various distant reading methods and tools with the aim of re-thinking the European literary history. We present the various steps that led to the production of the Serbian sub-collection: the novel selection and retrieval, text preparation, structural annotation, POS-tagging, lemmatization and named entity recognition. The Serbian sub-collection was published ...Ranka Stanković, Cvetana Krstev, Branislava Šandrih Todorović, Duško Vitas, Mihailo Škorić, Milica Ikonić Nešić. "Distant Reading in Digital Humanities: Case Study on the Serbian Part of the ELTeC Collection" in Proceedings of the Language Resources and Evaluation Conference, June 2022, Marseille, France, European Language Resources Association (2022)
Softverski alati za korišćenje resursa za srpski jezik
Ivan Obradović, Ranka Stanković (2008)... original text and its transla- tions in two ore more languages, and are called multilingual parallel texts. In the majority of cases, parallel texts are be- ing aligned, which turns a parallel texts into an aligned text. Sometimes, it is even considered that parallel texts are the same as aligned texts ...
... case, since non-aligned parallel texts are also sometimes being used (Ohmori and Higashida, 1999). The procedure of transforming a parallel text into an aligned text consists of two basic steps. In the first step parallel texts are split into segments, that is, basic units of text. Usually, sentences ...
... developed by (Gale and Church, 1993). Figure 2 depicts an example of an aligned text represented in the WS4LR tool. It is a legal texts in English and Serbian, aligned at the sentence level. Figure 2. Example of an aligned text Parallel corpora are very useful in the research pertaining to bilingual ...Ivan Obradović, Ranka Stanković. "Softverski alati za korišćenje resursa za srpski jezik" in INFOteka: časopis za informatiku i bibliotekarstvo, Belgrade, Serbia : Zajednica biblioteka univerziteta u Srbiji (2008)
SrpELTeC: A Serbian Literary Corpus for Distant Reading
U članku je predstavljen SrpELTeC, korpus razvijen u okviru akcije COST Distant Reading for European Literary History (CA16204). Svi romani u SrpELTeC-u su odabrani, pripremljeni i obeleženi korišćenjem zajedničkih principa uspostavljenih za sve jezičke zbirke u Evropskoj zbirci književnog teksta (ELTeC). Navedeni su izazovi i rešenja u pripremi SrpELTeC od nule. Svi romani su ručno kodirani u TEI sa bogatim metapodacima i strukturnim napomenama. Automatska anotacija je uključivala POS-označavanje, lematizaciju i imenovane entitete, oslanjajući se na resurse za obradu ...digital humanities, Serbian literature, text corpora, distant reading , linked data, named entity recognition, text analyticsRanka Stanković, Cvetana Krstev, Duško Vitas. "SrpELTeC: A Serbian Literary Corpus for Distant Reading" in Primerjalna književnost, Research Centre of the Slovenian Academy of Sciences and Arts (2024). https://doi.org/10.3986/pkn.v47.i2.03
Transformer-Based Composite Language Models for Text Evaluation and Classification
Parallel natural language processing systems were previously successfully tested on the tasks of part-of-speech tagging and authorship attribution through mini-language modeling, for which they achieved significantly better results than independent methods in the cases of seven European languages. The aim of this paper is to present the advantages of using composite language models in the processing and evaluation of texts written in arbitrary highly inflective and morphology-rich natural language, particularly Serbian. A perplexity-based dataset, the main asset for the ...Mihailo Škorić, Miloš Utvić, Ranka Stanković. "Transformer-Based Composite Language Models for Text Evaluation and Classification" in Mathematics, MDPI AG (2023). https://doi.org/10.3390/math11224660
Bilingual lexical extraction based on word alignment for improving corpus search
Jelena Andonovski, Branislava Šandrih, Olivera Kitanović. "Bilingual lexical extraction based on word alignment for improving corpus search" in The Electronic Library, Emerald (2019). https://doi.org/10.1108/EL-03-2019-0056
Using Query Expansion for Cross-Lingual Mathematical Terminology Extraction
Velislava Stoykova, Ranka Stanković (2018)Velislava Stoykova, Ranka Stanković. "Using Query Expansion for Cross-Lingual Mathematical Terminology Extraction" in Advances in Intelligent Systems and Computing, Springer International Publishing (2018). https://doi.org/10.1007/978-3-319-91189-2_16
Improving Document Retrieval in Large Domain Specific Textual Databases Using Lexical Resources
Large collections of textual documents represent an example of big data that requires the solution of three basic problems: the representation of documents, the representation of information needs and the matching of the two representations. This paper outlines the introduction of document indexing as a possible solution to document representation. Documents within a large textual database developed for geological projects in the Republic of Serbia for many years were indexed using methods developed within digital humanities: bag-of-words and named ...... Surrogates can also contain an abstract and/or a snippet, a relevant text fragment. The content of a document surrogate, or its part, can be generated automatically by extracting and selecting specific terms (words) from the document text. Language processing methods and techniques devel- oped within the ...
... textual content of the geological project. Future plans include digitalization and full text archiving of the project content, followed by the implementation of the approach described in this paper to this future full text database. 2.2 The Initial Solution for Document Retrieval The initial solution for ...
... normalizing length [8]. The improved system ranking uses several measures, starting with tf idf measure based on frequencies of words allocated to the text, text length, and the document frequency [14]. Further development included modification of tf idf with cosine normalization (tfc tfc), tfc nfc term weighting ...Ranka Stanković, Cvetana Krstev, Ivan Obradović, Olivera Kitanović. "Improving Document Retrieval in Large Domain Specific Textual Databases Using Lexical Resources" in Trans. Computational Collective Intelligence - Lecture Notes in Computer Science 26, Springer (2017). https://doi.org/10.1007/978-3-319-59268-8_8
WS4LR - a Worksation for Lexical Resources
... in Appendix B. 2.3 Aligned Texts A pair of semantically equivalent texts in different langauges, such as an original text and its translation, that are and aligned on a structural level (paragraph, sentence, phrase, etc.) is known as an aligned text or bitext. Aligned texts are usually constructed ...
... production of Intex/Unitex graphs that locate all literals from a chosen synset in a text, with or without synset hypernyms. 3.4 Working with Aligned Texts The module uses texts which have previously been aligned using Xalign as an alignment tool and converts them to TMX format, or texts that are ...
... 5-nholo_member ENG20-07295527-n C. Format of Aligned Text Serbian Originalholo_part Sportska prognoza je igra u kojoj učesnik, popunjavanjem listića koji izdaje priređivač igre na kojem su ... Cvetana Krstev, Ranka Stanković, Duško Vitas, Ivan Obradović. "WS4LR - a Worksation for Lexical Resources" in Proceedings of the Fifth Interantional Conference on Language Resources and Evaluation, Genoa, Italy, May 2006, ELRA - European Language Resources Association (2006)
A Lexical Approach to Acronyms and their Definitions
In this paper we present a comprehensive approach to acronyms for Natural-Language Processing (NLP) of Serbian texts. The proposed procedure includes extraction of acronyms and their definitions that are usual Multi-Word Units (MWUs), shallow parsing of MWUs that enables MWU lemmatization and production of entries in morphological electronic dictionaries, both for MWU and acronyms, that are provided with grammatical, syntactic, semantic and domain information. This approach enables representation that reflects complex relations between acronyms and their definitions.... biomed- ical text. In Pacific Symposium on Biocomputing, vol- ume 8. World Scientific. Spasic, I., S. Ananiadou, J. McNaught, and A. Kumar, 2005. Text mining and ontologies in biomedicine: mak- ing sense of raw text. Briefings in bioinformatics, 6(3):239–251. Taylor, Paul, 2009. Text-to-speech synthesis ...
... lex- icons. However, their adequate treatment is crucial for many applications, e.g. text-to-speech systems (Taylor, 2009), machine translation (Wolinski et al., 1995), index- ing for information retrieval and text classification. In order to adequately treat acronyms a link between them and a name ...
... tual Incompletness. In Proc. of the Corpus Linguistics Conference, Birmingham. Liberman, Mark Y and Kenneth W Church, 1992. Text analysis and word pronunciation in text-to-speech syn- thesis. Advances in speech signal processing:791–831. Moon, S., S. Pakhomov, and G. B. Melton, 2012. Auto- mated ...Cvetana Krstev, Duško Vitas, Ranka Stanković. "A Lexical Approach to Acronyms and their Definitions" in Proceedings of the 7th Language & Technology Conference, November 27-29, 2015, Poznań, Poland, Springer (2015)
Quantitative analysis of syllable properties in Croatian, Serbian, Russian, and Ukrainian
Biljana Rujević, Marija Kaplar, Sebastijan Kaplar, Ranka Stanković, Ivan Obradović, Jan Mačutek (2021)Biljana Rujević, Marija Kaplar, Sebastijan Kaplar, Ranka Stanković, Ivan Obradović, Jan Mačutek. "Quantitative analysis of syllable properties in Croatian, Serbian, Russian, and Ukrainian" in Language and Text: Data, models, information and applications, John Benjamins Publishing Company (2021). https://doi.org/10.1075/cilt.356.04ruj
Creation of a Training Dataset for Question-Answering Models in Serbian
Razvoj i primena veštačke inteligencije u jezičkim tehnologijama značajno su napredovali poslednjih godina, posebno u domenu zadatka odgovaranja na pitanja (Question Answering - QA). Dok su postojeći resursi za QA zadatke razvijeni za glavne svetske jezike, srpski jezik je relativno zanemaren u ovoj oblasti. Ovaj rad predstavlja inicijativu za kreiranje obimnog i raznovrsnog skupa podataka za obučavanje modela za odgovaranje na pitanja na srpskom jeziku, koji će doprineti unapređenju jezičkih tehnologija za srpski jezik. Pored brojnih istraživanja o jezičkim modelima ...veštačka inteligencija, obrada prirodnog jezika, jezički resursi, anotirani skupovi, ekstrakcija informacija, odgovaranje na pitanjaRanka Stanković, Jovana Rađenović, Maja Ristić, Dragan Stankov. "Creation of a Training Dataset for Question-Answering Models in Serbian" in South Slavic Languages in the Digital Environment JuDig Book of Abstracts, University of Belgrade - Faculty of Philology, Serbia, November 21-23, 2024, University of Belgrade - Faculty of Philology (2024)