Претрага
82 items
-
The Usage of Various Lexical Resources and Tools to Improve the Performance of Web Search Engines
In this paper we present how resources and tools developed within the Human Language Technology Group at the University of Belgrade can be used for tuning queries before submitting them to a web search engine. We argue that the selection of words chosen for a query, which are of paramount importance for the quality of results obtained by the query, can be substantially improved by using various lexical resources, such as morphological dictionaries and wordnets. These dictionaries enable semantic ...LR web services, MultiWord Expressions & Collocations, Information Extraction, Information Retrieval... čaša soka od paradajza (The ingredients for 10 portions: 3 onions, 1 cup of oil, ½ glass of white wine, 1 glass of tomato juice.) This false retrieval occurs because two constituents of the multi-word term are treated separately, and neither nearness conditions nor grammatical agreement conditions ...
... compounds words, but on the inflectional transducers as well. This enables a more elaborate query expansion that can significantly improve retrieval performances. For instance, if a query is performed with the keyword beli luk, three inflectional transducers are used: one for inflection of ...
... this query expands into only 12 combinations of an adjective form and a noun form, instead of 216 possible combinations, thus disabling false retrieval such as: Tako, posmatrano sa dna vidika, izgleda kao da iz širokih lukova belog mosta teče i razliva se ne samo zelena Drina… ‘Thus, from a bottom ...Krstev Cvetana, Stanković Ranka, Vitas Duško, Obradović Ivan. "The Usage of Various Lexical Resources and Tools to Improve the Performance of Web Search Engines" in LREC 2008: Conference on Language Resources and Evaluation, Marrakesh, Morocco, May 2008, European Language Resources Association (ELRA) (2008)
-
Managing mining project documentation using human language technology
Purpose: This paper aims to develop a system, which would enable efficient management and exploitation of documentation in electronic form, related to mining projects, with information retrieval and information extraction (IE) features, using various language resources and natural language processing. Design/methodology/approach: The system is designed to integrate textual, lexical, semantic and terminological resources, enabling advanced document search and extraction of information. These resources are integrated with a set of Web services and applications, for different user profiles and use-cases. Findings: The ...Digital libraries, Information retrieval, Data mining, Human language technologies, Project documentationAleksandra Tomašević, Ranka Stanković, Miloš Utvić, Ivan Obradović, Božo Kolonja . "Managing mining project documentation using human language technology" in The Electronic Library (2018). https://doi.org/10.1108/EL-11-2017-0239
-
Medical Domain Document Classification via Extraction of Taxonomy Concepts from MeSH Ontology
Mihailo Škorić, Mauro Dragoni (2019)This paper is a result of a task that was presented to attendants of Keyword Search in Big Linked Data summer school, that was organized by Vienna University of Technology, under the Keystone COST action in the summer of 2017. It presents a specific approach to the classification via creation of minimal document surrogates based on the US National medical library’s MeSH ontology, which is derived from the Medical Subject Headings thesaurus. In a series of previously classified medically ...... concurs with current standards in the field of classification of medical records (Caĺi et al., 2017). A major problem with concept-oriented information retrieval in the biomedical sphere is the large number of miss- classified documents, leading to a very low response rate. Low precision is thus acceptable ...
... use, where a number of terms is associated with English and Latin equivalents, allowing for the extension of the search for concept names and their retrieval in documents. However, rich Serbian language morphology should be taken into account 68 Infotheca Vol. 19, No. 1, September 2019 Scientific paper ...
... Trieschnigg, Dolf, Piotr Pezik, Viv Lee, Franciska de Jong, Wessel Kraaij et al.. “MeSH Up: Effective MeSH Text Classification for Improved Document Retrieval”. Bioinformatics (Oxford, England) Vol. 25 (2009): 1412–8 Infotheca Vol. 19, No. 1, September 2019 69 ...Mihailo Škorić, Mauro Dragoni. "Medical Domain Document Classification via Extraction of Taxonomy Concepts from MeSH Ontology" in Infotheca, Faculty of Philology, University of Belgrade (2019). https://doi.org/10.18485/infotheca.2019.19.1.3
-
Classification of Terms on a Positive-Negative Feelings Polarity Scale Based on Emoticons
Mihailo Škorić (2017)The goal of this paper is to draw attention to the possibility of using emoticon-riddled text on the web in language-neutral sentiment analysis. It introduces several innovations in the existing framework of research and tests their effectiveness. It also presents a software tool especially made for that purpose, explains how it builds a database with sentimental value of terms and offers the user manual. Finally, it presents a software tool that tests the new database and gives some examples ...... research to quickly and efficiently collect large amounts of information. Developing intelligent systems that work with information: – Information retrieval: retrieval of specific information in the text, as well as finding information that can not be precisely defined. Classification of texts according ...
... good idea to ignore any available resources and that could po- tentially be useful. The main idea of this research is to use data mining methods for retrieval of metadata – in shape of determiners, that users of social networks inadvertently use in their messages (in the form of emoti- cons or language-universal ...
... it presents a software tool that tests the new database and gives some examples of the analysis of the ob- tained results. KEYWORDS: data mining, information extraction, emotions, text on the web. PAPER SUBMITTED: 24 January 2017 PAPER ACCEPTED: 25 March 2017 Mihailo Škorić miks@tesla.rcub.bg.ac.rs ...Mihailo Škorić. "Classification of Terms on a Positive-Negative Feelings Polarity Scale Based on Emoticons" in Infotheca, Faculty of Philology, University of Belgrade (2017). https://doi.org/10.18485/infotheca.2017.17.1.4
-
Bilingual lexical extraction based on word alignment for improving corpus search
Jelena Andonovski, Branislava Šandrih, Olivera Kitanović. "Bilingual lexical extraction based on word alignment for improving corpus search" in The Electronic Library, Emerald (2019). https://doi.org/10.1108/EL-03-2019-0056
-
Creation of a Training Dataset for Question-Answering Models in Serbian
Razvoj i primena veštačke inteligencije u jezičkim tehnologijama značajno su napredovali poslednjih godina, posebno u domenu zadatka odgovaranja na pitanja (Question Answering - QA). Dok su postojeći resursi za QA zadatke razvijeni za glavne svetske jezike, srpski jezik je relativno zanemaren u ovoj oblasti. Ovaj rad predstavlja inicijativu za kreiranje obimnog i raznovrsnog skupa podataka za obučavanje modela za odgovaranje na pitanja na srpskom jeziku, koji će doprineti unapređenju jezičkih tehnologija za srpski jezik. Pored brojnih istraživanja o jezičkim modelima ...veštačka inteligencija, obrada prirodnog jezika, jezički resursi, anotirani skupovi, ekstrakcija informacija, odgovaranje na pitanjaRanka Stanković, Jovana Rađenović, Maja Ristić, Dragan Stankov. "Creation of a Training Dataset for Question-Answering Models in Serbian" in South Slavic Languages in the Digital Environment JuDig Book of Abstracts, University of Belgrade - Faculty of Philology, Serbia, November 21-23, 2024, University of Belgrade - Faculty of Philology (2024)
-
Improving Document Retrieval in Large Domain Specific Textual Databases Using Lexical Resources
Large collections of textual documents represent an example of big data that requires the solution of three basic problems: the representation of documents, the representation of information needs and the matching of the two representations. This paper outlines the introduction of document indexing as a possible solution to document representation. Documents within a large textual database developed for geological projects in the Republic of Serbia for many years were indexed using methods developed within digital humanities: bag-of-words and named ...... selecting and ranking of retrieval results of indexed documents for a specific query and the results were compared with the initial retrieval system that was already in place. In general, a significant improvement has been achieved according to the standard information retrieval performance measures, where ...
... selecting and ranking of retrieval results of indexed documents for a specific query and the results were compared with the initial retrieval system that was already in place. In general, a significant improvement has been achieved according to the standard information retrieval performance measures, where ...
... combining information that is relevant to a user query [1]. Large textual databases, that is, large collec- tions of textual documents are an example of big data, which pose three basic problems to Information Retrieval (IR): the representation of document content, the representation of user information needs ...Ranka Stanković, Cvetana Krstev, Ivan Obradović, Olivera Kitanović. "Improving Document Retrieval in Large Domain Specific Textual Databases Using Lexical Resources" in Trans. Computational Collective Intelligence - Lecture Notes in Computer Science 26, Springer (2017). https://doi.org/10.1007/978-3-319-59268-8_8
-
Knowledge and Rule-Based Diacritic Restoration in Serbian
In this paper we present a procedure for the restoration of diacritics in Serbian texts written using the degraded Latin alphabet. The procedure relies on the comprehensive lexical resources for Serbian: the morphological electronic dictionaries, the Corpus of Contemporary Serbian and local grammars. Dictionaries are used to identify possible candidates for the restoration, while the dataobtainedfromSrpKorandlocalgrammarsassistsinmakingadecisionbetween several candidates in cases of ambiguity. The evaluation results reveal that,dependingonthetext,accuracyrangesfrom95.03%to99.36%,whilethe precision (average 98.93%) is always higher than the recall (average 94.94%).... documents in information- analytical systems and natural language processing. These resources are linguistic ontologies uniting some principles of their organization from WordNet, information-retrieval thesauri and formal ontolo- gies. They were utilized in various information-retrieval and NLP applications ...
... express it (NISO, 2005). Contemporary standards for developing of information-retrieval thesauri stress that thesaurus relations are established between concepts, not between terms (Clarke and Zeng, 2012). However, information-retrieval thesauri are not intended for use in automatic processing of texts: ...
... . Each concept has a unique, unambiguous name. In this, RuThes is similar to information-retrieval thesauri and formal ontologies. Rules for inclusion of phrases in the thesaurus are more similar to information-retrieval thesauri guidelines (NISO, 2005). Each concept is linked with words and phrases ...Cvetana Krstev, Ranka Stanković, Duško Vitas. "Knowledge and Rule-Based Diacritic Restoration in Serbian" in Proceedings of the Third International Conference Computational Linguistics in Bulgaria (CLIB 2018), May 27-29, 2018, Sofia, Bulgaria, Sofia : The Institute for Bulgarian Language Prof. Lyubomir Andreychin, Bulgarian Academy of Sciences (2018): 41-51
-
Indexing of textual databases based on lexical resources: A case study for Serbian
In this paper we describe an approach to improvement of information retrieval results for large textual databases by pre-indexing documents using bag-of-words and Named Entity Recognition. The approach was applied on a database of geological projects financed by the Republic of Serbia in the last half century. Each document within this database is described by metadata, consisting of several fields such as title, domain, keywords, abstract, geographical location and the like. A bag of words was produced from these ...... terms of Mean Average Precision measure (MAP). 1 Introduction Three basic problems related to Information Retrieval (IR) are the presentation of document content, the presentation of information needs and the comparison of these two representations. Presentation of documents as a rule contains meta- data ...
... org/10.1007/3-540-51465-1 3 3. Hiemstra, D.: Using language models for information retrieval. Taaluitgeverij Nes- lia Paniculata (2001) 4. Jackson, P., Moulinier, I.: Natural language processing for online applications: Text retrieval, extraction and categorization, vol. 5. John Benjamins Publishing (2007) ...
... rs 2 University of Belgrade, Faculty of Philology, cvetana@matf.bg.ac.rs Abstract. In this paper we describe an approach to improvement of information retrieval results for large textual databases by pre-indexing documents using bag-of-words and Named Entity Recognition. The ap- proach was applied on ...Ranka Stanković, Cvetana Krstev, Ivan Obradović, Olivera Kitanović. "Indexing of textual databases based on lexical resources: A case study for Serbian" in Semantic Keyword-based Search on Structured Data Sources : First COST Action IC1302 International KEYSTONE Conference, IKC 2015, Coimbra, Portugal, September 8-9, 2015. Revised Selected Papers, Springer (2015). https://doi.org/10.1007/978-3-319-27932-9_15
-
From ELTeC Text Collection Metadata and Named Entities to Linked-data (and Back)
In this paper we present the wikification of the ELTeC (European Literary Text Collection), developed within the COST Action ``Distant Reading for European Literary History'' (CA16204). ELTeC is a multilingual corpus of novels written in the time period 1840—1920, built to apply distant reading methods and tools to explore the European literary history. We present the pipeline that led to the production of the linked dataset, the novels’ metadata retrieval and named entity recognition, transformation, mapping and Wikidata population, ...Milica Ikonić Nešić, Ranka Stanković, Christof Schöch and Mihailo Škorić. "From ELTeC Text Collection Metadata and Named Entities to Linked-data (and Back)" in Proceedings of The 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference, June 2022, Marseille, France, European Language Resources Association (2022)
-
The Nooj System as Module within an Integrated Language Processing Environment
... same for all literals in one synset. Enriching the synsets with information from the morphological dictionary makes the usage of a wordnet in an information retrieval task more efficient. In a number of cases this additional information disambiguates the otherwise homonymous literals. This additional ...
... transformed in some way in order to improve the performance of document retrieval. Namely, a search by a concept instead of a search by a single word form is recognized as a very important new direction in information retrieval and related areas. If query is further combined with ILI, a multilingual ...
... perform the actual retrieval, or import them in NooJ and convert to syntactic grammars in order to perform the same task. Figure 5. The edit view, the hypernym/hyponym and graph view of a synset 3.3. The exchange of information 3.3.1. Usually, the only grammatical information accompanying the ...Ranka Stanković, Duško Vitas, Cvetana Krstev. "The Nooj System as Module within an Integrated Language Processing Environment" in Proceedings of the 2007 International Nooj Conference, Cambridge Scholars Publishing (2008)
-
Using Query Expansion for Cross-Lingual Mathematical Terminology Extraction
Velislava Stoykova, Ranka Stanković (2018)Velislava Stoykova, Ranka Stanković. "Using Query Expansion for Cross-Lingual Mathematical Terminology Extraction" in Advances in Intelligent Systems and Computing, Springer International Publishing (2018). https://doi.org/10.1007/978-3-319-91189-2_16
-
GIS Application Improvement with Multilingual Lexical and Terminological Resources
... annotation is the text or graphics on a map that provide substantial information for the map reader. Annotation may identify or describe a specific map entity, provide general information about an area on the map, or supply information about the map itself. In general, the placement of descriptive ...
... play an important role in the query expansion application WS4QE, by enabling a more elaborate query expansion that can significantly improve retrieval performances. The use of transducers is especially important in the case of compounds. For instance, if a query is performed with the compound ...
... AND kvarcna steno AND kvarcnom stenom AND kvarcnoj steni AND kvarcnih stena AND kvarcnim stenama AND kvarcnima stenama thus disabling false retrieval. Due to the abundance of compounds in Serbian, the development of a comprehensive dictionary of Serbian compounds is a tedious task. In the ...Ranka Stanković, Ivan Obradović, Olivera Kitanović. "GIS Application Improvement with Multilingual Lexical and Terminological Resources" in Proceedings of the 5th International Conference on Language Resources and Evaluation, LREC 2010, Valetta, Malta, May 2010, Valetta, Malta : European Language Resources Association (2010)
-
Terminological and lexical resources used to provide open multilingual educational resources
Open educational resources (OER) within BAEKTEL (Blending Academic and Entrepreneurial Knowledge in Technology enhanced learning) network will be available in different languages, mostly in the languages of Western Balkans, Russian and English. University of Belgrade (UB) hosts a central repository based on: BAEKTEL Metadata Portal (BMP), terminological web application for management, browse and search of terminological resources, web services for linguistic support (query expansion, information retrieval, OER indexing, etc.), annotation of selected resources and OER repository on local edX ...... cal web application for management, browse and search of terminological resources, web services for linguistic support (query expansion, information retrieval, OER indexing, etc.), annotation of selected resources and OER repository on local edX platform. In order to successfully cope with multi ...
... web application for management, browse and search of terminological resources, ∑ web services for linguistic support (query expansion, information retrieval, OER indexing, etc.), ∑ annotation of selected resources, ∑ OER repository on local edX platform. The BAEKTEL language support system consists ...
... extraction, which will be further discussed, can be found in machine translation, automatic indexing, building lexical knowledge bases and information retrieval [12]. Once they are extracted, completed ontologies represent an important education resource. In order for these ontologies to be as efficient ...Biljana Lazić, Danica Seničić, Aleksandra Tomašević, Bojan Zlatić. "Terminological and lexical resources used to provide open multilingual educational resources" in The Seventh International Conference on eLearning (eLearning-2016), 29-30 September 2016, Belgrade, Serbia, Belgrade : Belgrade Metropolitan University (2016)
-
Rule-based Automatic Multi-word Term Extraction and Lemmatization
In this paper we present a rule-based method for multi-word term extraction that relies on extensive lexical resources in the form of electronic dictionaries and finite-state transducers for modelling various syntactic structures of multi-word terms. The same technology is used for lemmatization of extracted multi-word terms, which is unavoidable for highly inflected languages in order to pass extracted data to evaluators and subsequently to terminological e-dictionaries and databases. The approach is illustrated on a corpus of Serbian texts from ...... can be applied to the extraction of MWUs belonging to general lexica. Expanding the e-dictionaries will further improve systems for information retrieval, information extraction, query expansion and the like. One useful application can also be the creation of bilingual and multilingual terminological ...
... reject falsely offered lemmas and then ranked by introducing measures that combine linguistic and statistical information (C-Value, T-Score, LLR, and Keyness). Mean average precision for retrieval of MWU forms ranges from 0.789 to 0.804, while mean average precision of lemma production ranges from 0.956 ...
... solution depicted in Figure 2 is based on web services, thus enabling other applications to use some of them, such as indexing or document information retrieval, for term extraction. The current application is developed and tested within a Windows environment, while a corresponding web application ...Ranka Stanković, Cvetana Krstev, Ivan Obradović, Biljana Lazić, Aleksandra Trtovac. "Rule-based Automatic Multi-word Term Extraction and Lemmatization" in Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016, Portorož, Slovenia, 23--28 May 2016, European Language Resources Association (2016)
-
E-Connecting Balkan Languages
In this paper we present a versatile language processing tool that can be successfully used for many Balkan languages. This tool relies for its work on several sophisticated textual and lexical resources that were developed for most of Balkan languages. These resources are based on several de facto standards in natural language processing.... equivalent for кораб was almost unmistakably brod, as suggested by both wordnets. Figure 7. A few examples of a partial retrieval Figure 7 shows some examples of a partial retrieval. First (n1616) and third (n2286) segments in this sample occur due to the fact that the reference to a ‘boat’ is missing ...
... more specific type of a vessel referred to in Serbian as kuter ‘cutter’. Figure 8. All occurreneces of a full retrieval Figure 8 shows eight examples of the full retrieval. In one of these examples (n1972) for the Serbian čamac the near synonym in Bulgarian корабчето is used (as determined ...
... difference is that expanded queries are not applied to an aligned text but are rather forwarded to the search engine. Figure 10 shows such an retrieval that starts with the Serbian keyword barka ‘boat’ and is further expended by the Serbian synset {barka:1, čamac:1, čun:1} and Greek corresponding ...Cvetana Krstev, Ranka Stanković, Duško Vitas, Svetla Koeva. "E-Connecting Balkan Languages" in Proceedings of the Workshop Workshop on Multilingual resources, technologies and evaluation for Central and Eastern European Languages, 17 September 2009, eds. C. Vertan, S. Piperidis, E. Paskaleva and Milena Slavcheva, Borovets, Bulgaria : Association for Computational Linguistics Stroudsburg, PA, USA (2009)
-
Distant Reading in Digital Humanities: Case Study on the Serbian Part of the ELTeC Collection
Ranka Stanković, Cvetana Krstev, Branislava Šandrih Todorović, Duško Vitas, Mihailo Škorić, Milica Ikonić Nešić (2022)In this paper we present the Serbian part of the ELTeC multilingual corpus of novels written in the time period 1840-1920. The corpus is being built in order to test various distant reading methods and tools with the aim of re-thinking the European literary history. We present the various steps that led to the production of the Serbian sub-collection: the novel selection and retrieval, text preparation, structural annotation, POS-tagging, lemmatization and named entity recognition. The Serbian sub-collection was published ...Ranka Stanković, Cvetana Krstev, Branislava Šandrih Todorović, Duško Vitas, Mihailo Škorić, Milica Ikonić Nešić. "Distant Reading in Digital Humanities: Case Study on the Serbian Part of the ELTeC Collection" in Proceedings of the Language Resources and Evaluation Conference, June 2022, Marseille, France, European Language Resources Association (2022)
-
Serbian ELTeC Sub-Collection in Wikidata
This paper presents an example of integration of Wikidata with digital libraries and external systems, as well as some best practices for speeding up the process of data preparation and import to Wikidata, on the use case of SrpELTeC, Serbian subcollection of the ELTeC multilingual collection (European Literary Text Collection). After preliminary work on the manual Wikidata population with SrpELTeC novels, the goal was to automate the process of preparing and importing information, so different solutions were analysed and ...Milica Ikonić Nešić, Ranka Stanković, Biljana Rujević. "Serbian ELTeC Sub-Collection in Wikidata" in Infotheca, Faculty of Philology, University of Belgrade (2021). https://doi.org/10.18485/infotheca.2021.21.2.4
-
Improvement of geodatabase queries within GeolISS
Ranka Stanković (2008)... (ILI) enables the connection of the same concepts in different languages, a feature that can be used, among others, for cross-language information retrieval. For expansion of queries with proper names WS4LR is using Prolex, a multilingual database of proper names which represents the implementation ...
... objects such as distances, disjoints, intersects, touches, crosses, overlaps, contains, as well as the computation of lengths and areas. Data retrieval from the GeolISS geodatabase is provided on several levels: − Searching on the level of interface forms − Spatial object search using GeolISS ...
... language (presently only English equivalents are in the database). Ranka Stanković 72 For illustration purposes, the query for geological unit retrieval with the term clay in the description field was submitted twice: once without and once with morphological expansion. The word clay, in Serbian ...Ranka Stanković. "Improvement of geodatabase queries within GeolISS" in Review of the National Center for Digitization, Beograd : Faculty of Mathematics, Belgrade (2008)
-
Development of Open Educational Resources (OER) for Natural Language Processing
In this paper we present the development of an online course at the edX BAEKTEL platform named “Lexical Recognition in the Natural Language Processing (NLP)”. It is based on the course of the same name for PhD studies at the University of Belgrade, Faculty of Philology. There are not many courses in Computational Linguistics (CL) on OER platforms, and there is none in Serbian either for CL or NLP. We have developed this course in order to improve this ...... syntax and parsing, language modelling and word sense disambiguation, part of speech tagging and information extraction, question answering, text summarization, collocations and information retrieval, sentiment analysis and semantics, discourse, machine translation, regular expressions, language ...
... management, browse and search of terminological resources (work in progress). Web services for linguistic support (query expansion, information retrieval, OER indexing, etc.) (http://hlt.rgf.bg.ac.rs/) Annotation of selected resources 4. UNITEX - OPEN ACCESS, OPEN SOURCE, OPEN E ...
... in other languages. We also plan to add more courses in CL and NLP at the edeX BAEKTEL platform (Basics of Theory of Formal Languages, Information Retrieval, etc.). LITERATURE [1] Carlucci, D., et al., A platform for management of academic and entrepreneurial knowledge, in IFKAD 2015 - ...Cvetana Krstev, Biljana Lazić, Ranka Stanković, Giovanni Schiuma, Miladin Kotorčević. "Development of Open Educational Resources (OER) for Natural Language Processing" in The Sixth International Conference on e-Learning (eLearning-2015), September 2015, Belgrade, Serbia, Belgrade : Belgrade Metropolitan Univesity (2015)