Romanian Language Technology — a view from an academic perspective

Dan Tufiș

doi:10.15837/ijccc.2022.1.4641

Authors

Dan Tufiș Research Institute for Artificial Intelligence, Romanian Academy

DOI:

https://doi.org/10.15837/ijccc.2022.1.4641

Keywords:

Romanian language resources and technologies, Meta-Net White papers, European Language Equality

Abstract

The article reports on research and developments pursued by the Research Institute for Artificial Intelligence "Mihai Draganescu" of the Romanian Academy in order to narrow the gaps identified by the deep analysis on the European languages made by Meta-Net white papers and published by Springer in 2012. Except English, all the European languages needed significant research and development in order to reach an adequate technological level, in line with the expectations and requirements of the knowledge society.

Author Biography

Dan Tufiș, Research Institute for Artificial Intelligence, Romanian Academy

Member of the Romanian Academy, Professor, Ph. D., Senior Scientist grade I.

1992 Ph.D. in Computer Science, Polytechnic Institute of Bucharest, Romania
1991 Post graduate studies in Computational Linguistics, Linguistic Institute, University of California at Santa Cruz, USA
1979 B.Sc. and M.Sc. in Computer Science, Department of Computer Science, Polytechnic Institute of Bucharest, Romania

References

[1] Avram, A.-M., Paiș, V., and Tufiș, D. (2020a). Towards a romanian end-to-end automatic speech recognition based on deepspeech2. Proceedings of the Romanian Academy Series A, 21:395-402.

[2] Avram, A.-M., Păiș, V., and Tufiș, D. (2020b). Romanian speech recognition experiments from the robin project. In The 15th International Conference on Linguistic Re-sources and Tools for Natural Language Processing, pages 103-114.

[3] Barbu, A.-M. (2008). Romanian lexical data bases: Inflected and syllabic forms dictionaries. In LREC.

[4] Barbu Mititelu, V., Irimia, E., Tufiș, D. (2014). CoRoLa - The Reference Corpus of Contemporary Romanian Language. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), May 26-31, 2014, Reykjavik, Iceland, ISBN 978-2- 9517408-8-4, pages 1235-1239.

[5] Barbu Mititelu, V., Tufiș, D., and Irimia, E. (2018). The reference corpus of the contemporary romanian language (corola). In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), pages 1178-1185.

[6] Barbu Mititelu, V., Tufiș, D., Irimia, E., Păiș,, V., Ion, R., Diewald, N., Mitrofan, M., and, Onofrei, M. (2019). Little strokes fell great oaks. creating corola, the reference corpus of contemporary romanian. In Revue Roumaine de Linguistique, No./Issue 3.

[7] Boroș, T., Dumitrescu, C. D., and Păiș, V. (2018). Tools and resources for romanian textto- speech and speech-to-text applications. In Proceedings of the International Conference on Human-Computer Interaction (RoCHI), pages 46-53.

[8] Ceaușu, A., Tufiș D. (2011) Addressing SMT Data Sparseness when Translating into Morphologically-Rich Languages. In Bernadette Sharp, Michael Zock, Michael Carl, and Arnt Lykke Jakobsen (eds.) Proceedings of the 8th international NLPCS workshop. Special theme: Human-machine interaction in translation, pp. 57-68, Copenhagen Business School, 20-21 August 2011.

[9] Cimiano, P., Chiarcos, C., McCrae, J. P., and Gracia, J. (2020). Linguistic Linked Data. Representation, Generation and Applications. Springer. https://doi.org/10.1007/978-3-030-30225-2

[10] Cristea, D., Diewald, N., Haja, G., Maranduc, C., Barbu Mititelu, V., and Onofrei, M. (2019). How to find a shining needle in the haystack. querying corola: solutions and perspectives. RRL, (3):279-292.Fazekas, G. and Sandler, M. B. (2011). The studio ontology framework. In 12th International Society for Music Information Retrieval Conference (ISMIR).

[11] Cristea, D., Pistol, I., Boghiu, È˜., Bibiri, A., D., Gí®fu, D., Onofrei, M., Trandabat,, D., Bugeag, G. (2020). CoBiLiRo: A Research Platform for Bimodal Corpora. Proceedings of the 1st International Workshop on Language Technology Platforms, LREC 2020, Marseille, pages 22-27,

[12] Fellbaum, C. (1998). WordNet: An Electronic Lexical Database. Bradford Books. Hauck, E., Ewert, D., Gramatke, A., and Henning, K. (2011). Software architecture, knowledge compiler and ontology design for cognitive technical systems suitable for controlling assembly tasks. In Jeschke, S., Isenhardt, I., and Henning, K., editors, Automation, Communication and Cybernetics in Science and Engineering 2009/2010, pages 383-391, Berlin, Heidelberg. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-16208-4_35

[13] Ide, N. and Pustejovsky, J. (2010). What does interoperability mean, anyway ? toward an operational definition of interoperability for language technology. In Proceedings of the 2nd International Conference on Global Interoperability for Language Resources (ICGL 2010).

[14] Ion, R. (2012) Graphic Comparability Levels for Comparable Corpora. In Mihai Alex Moruz, Dan Cristea, Dan Tufiș Adrian Iftene, Horia-Nicolai Teodorescu (eds.) Proceedings of the 8th International Conference "Linguistic Resources and Tools for Processing of the Romanian Language", pp. 127-133, April 26-27, 2012

[15] Ion, R., Tufiș D., Boroș, T., Ceausu, A., Stefanescu D. (2010). On-Line Compilation of Comparable Corpora and their Evaluation. In Marko Tadic, Mila Dimitrova-Vulchanova, and Svetla Koeva (eds.), Proceedings of The 7th International Conference Formal Approaches to South Slavic and Balkan Languages (FASSBL-7), pp. 29-34, Croatian Language Technologies Society - Faculty of Humanities and Social Sciences, Zagreb, Croatia, October 2010. ISBN: 978-953- 55375-2-6.

[16] Ion, R., Ceaus,u Al., Irimia, E. (2011): An Expectation Maximization Algorithm for Textual Unit Alignment. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora, pages 128-135, The 49th Annual Meeting of the Association for Computational Linguistics, Portland, Oregon, 2011.

[17] Ion, R. (2012) PEXACC: A Parallel Sentence Mining Algorithm from Comparable Corpora. Proceedings of the Eigth International Conference on Language Resources and Evaluation (LREC'2012), pages 2181-2188.

[18] Ion, R., Irimia, E., șefanescu, D., Tufiș, D. (2012): ROMBAC: The Romanian Balanced Annotated Corpus. In Proceedings of the 8th LREC Conference, Istanbul, Turkey, 21-27 May, 2012, pp.339-344, ISBN 978-2-9517408-7-7.

[19] Ion, R. (2018). TEPROLIN: An extensible, online text preprocessing platform for Romanian. In The 13th International Conference on Linguistic Resources and Tools for Natural Language Processing - CONSILR.

[20] Ion, R., Badea, V. G., Cioroiu, G., Barbu Mititelu, V., Irimia, E., Mitrofan, M., and Tufiș, D. (2020). A dialog manager for micro-worlds. Studies in Informatics and Control, 29(4):411-420. https://doi.org/10.24846/v29i4y202003

[21] Irimia, E. (2012). Experimenting with Extracting Lexical Dictionaries from Comparable Corpora for English-Romanian language pair, In Proceedings of The Fifth Workshop on Building and Using Comparable Corpora (5th BUCC), LREC 2012, Istanbul Turkey.

[22] Irimia, E. (2011). DEACC - Lexical Dictionary Extractor from Comparable Corpora, In Proceedings of the 8th International Conference "Linguistic Resources and Tools for Processing of the Romanian Language", December 8-9, 2011 and April 26-27, 2012, Bucharest, Romania, Eds. Moruz, Mihai Alex; Cristea, Dan; Tufiș Dan; Iftene, Adrian; Teodorescu, Horia-Nicolai, "Alexandru Ioan Cuza" University Publishing House, Ias, i, pp. 173-180.

[23] Klyne, G., Carroll, J., and McBride, B. (2004). Resource Description Framework (RDF): Concepts and Abstract Syntax.

[24] Koehn, P., Hoang, H., Birch, A., Burch, C., C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C, Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E. (2007). Moses: Open Source Toolkit for Statistical Machine Translation. Proceedings of the ACL 2007, Prague, pages 177-180. https://doi.org/10.3115/1557769.1557821

[25] Kumar, K., Haider, M. T. U., and Ahsan, S. S. (2021). Ontology-based full-text searching using named entity recognition. In Hura, G. S., Singh, A. K., and Siong Hoe, L., editors, Advances in Communication and Computational Technology, pages 211-222, Singapore. Springer Singapore. https://doi.org/10.1007/978-981-15-5341-7_17

[26] Manzoor, S., Rocha, Y. G., Joo, S.-H., Bae, S.-H., Kim, E.-J., Joo, K.-J., and Kuc, T.-Y. (2021). Ontology-based knowledge representation in robotic systems: A survey oriented toward applications. Applied Sciences, 11(10). https://doi.org/10.3390/app11104324

[27] Miller, George A. (1995). WordNet: A Lexical Database for English. Communications of the ACM Vol. 38, No. 11: 39-41 https://doi.org/10.1145/219717.219748

[28] Mititelu Barbu, V., Irimia, E., Tufiș, D. (2014). CoRoLa - The Reference Corpus of Contemporary Romanian Language.

[29] Mitrofan, M., Barbu Mititelu, V., Mitrofan G. (2019). MoNERo: a Biomedical Gold Standard Corpus for the Romanian Language. In Proceedings of the BioNLP workshop. Association for Computational Linguistics, Florence, Italy, pp. 71-79, aug 2019 https://doi.org/10.18653/v1/W19-5008

[30] Oltramari, A. and Lebiere, C. (2013). Knowledge in Action: Integrating Cognitive Architectures and Ontologies, pages 135-154. Springer Berlin Heidelberg, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31782-8_8

[31] Păiș, V. and Tufiș, D. (2018a) Computing distributed representations of words using the CoRoLa corpus. In Proceedings of the Romanian Academy Series A - Mathematics Physics Technical Sciences Information Science. vol. 19, no. 2, pp. 185-191.

[32] Păiș, V. and Tufiș, D. (2018b). More Romanian word embeddings from the RETEROM project. In Proceedings of the International Conference on Linguistic Resources and Tools for Processing Romanian Language - CONSILR. pp. 91-100

[33] Păiș, V., Mitrofan, M., Gasan, C. L., Coneschi, Vl., and Ianov, A. (2021). Named Entity Recognition in the Romanian Legal Domain. In Proceedings of the Natural Legal Language Processing Workshop 2021. Association for Computational Linguistics, Punta Cana, Dominican Republic, pp. 9-18, Nov. 2021. https://doi.org/10.18653/v1/2021.nllp-1.2

[34] Păiș, V., Mitrofan, M. (2021). Towards a named entity recognition system in the Romanian legal domain using a linked open data corpus. In Workshop on Deep Learning and Neural Approaches for Linguistic Data. Skopje, North Macedonia, pp. 16-17, Sept. 2021 https://doi.org/10.18653/v1/2021.nllp-1.2

[35] Păiș, V., Ion, R., Barbu Mititelu, V., Irimia, E., Mitrofan, M., and Avram, A. (2021). Robin technical acquisition speech corpus. Zenodo, March 2021, 10.5281/zenodo.4626539

[36] Păiș, V. (2020). Multiple annotation pipelines inside the relate platform. In The 15th International Conference on Linguistic Resources and Tools for Natural Language Pro-cessing, pages 65-75.

[37] Păiș, V., Ion, R., and Tufiș, D. (2020). A processing platform relating data and tools for Romanian language. In Proceedings of the 1st International Workshop on Language Technology Platforms, pages 81-88, Marseille, France. European Language Resources Association.

[38] Păiș, V., Tufiș, D., and Ion, R. (2019). Integration of romanian nlp tools into the relate platform. In International Conference on Linguistic Resources and Tools for Natural Language Processing.

[39] Păiș V., Ion, R., Avram, A.-M., Irimia, E., Mititelu, V. B., and Mitrofan, M. (2021). Humanmachine interaction speech corpus from the robin project. In 2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pages 91-96. https://doi.org/10.1109/SpeD53181.2021.9587355

[40] Pinnis, M., Ion, R., Stefanescu, D., Su, F., Skadin, a, I., Vasil,jevs, A. Babych, B.(2012). Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora. In Proceedings of ACL 2012, System Demonstrations Track, Jeju Island, Republic of Korea, July 8-14, 2012

[41] Pinnis, M., Ljubešic, N., Stefanescu, D., Skadin, a, I., Tadic, M., Gornostay, T. (2012). Terminology Extraction and Mapping Tools for Under-Resourced Languages, in Proceedings of the 10th Terminology and Knowledge Engineering Conference (TKE 2012), Madrid, Spain.

[42] Skadin, a, I., Vasil,jevs, A., Skadin, š, R., Gaizauskas, R., Tufiș D., Gornostay, T. (2010). Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation. In Proceedings of the 3rd Workshop on Building and Using Comparable Corpora" (BUCC10) at the 7th Language Resources and Evaluation Conference (LREC 2010), pp. 6-14, Valletta, Malta, May 2010.

[43] Skadin, a, I., Aker, A., Giouli, V., Tufiș D., Gaizauskas, R., Mieirin, a, M., Mastropavlos, N. (2010). A Collection of Comparable Corpora for Under-resourced Languages. In Inguna Skadin, a and Andrejs Vasil,jevs (eds.), Frontiers in Artificial Intelligence and Applications, volume 219: Human Language Technologies - The Baltic Perspective - Proceedings of the Fourth International Conference Baltic (HLT 2010), pp. 161-168, IOS Press, Riga, Latvia, October 2010. ISBN: 978-1-60750-640-9.

[44] Skadin, a, I., Aker, A., Mastropavlos, N., Su, F., Tufiș, D., Verlic, M., Vasil,jevs, A., Babych, B., Clough, P., Gaizauskas, R., Glaros, N., Lestari Paramita, M. Pinnis, M.(2012). Collecting and Using Comparable Corpora for Statistical Machine Translation. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Ugur Dogan, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis (eds.), Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), pp. 438-445, May 23-25, 2012, Istanbul, Turkey. ISBN: 978-2-9517408-7-7

[45] Stan, A., Yamagishi, J., King, S., and Aylett, M. (2011). The romanian speech synthesis (rss) corpus: Building a high quality hmm-based speech synthesis system using a high sampling rate. Speech Communication, 53(3):442-450. https://doi.org/10.1016/j.specom.2010.12.002

[46] Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiș D., & Varga, D. (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th LREC Conference, Genoa, Italy, 22-28 May, 2006, pp.2142-2147, ISBN 2-9517408-2-4, EAN 9782951740822, arXiv preprint cs/0609058.

[47] Ștefănescu, D. (2012). Extracting Parallel Terminology from Comparable Corpora, In Mihai Alex Moruz, Dan Cristea, Dan Tufiș Adrian Iftene, Horia-Nicolai Teodorescu (eds.) Proceedings of the 8th International Conference "Linguistic Resources and Tools for Processing of The Romanian Language", pp. 181-188, April 26-27, 2012.

[48] Ștefănescu, D., Ion, R., Hunsicker, S. (2012). Hybrid Parallel Sentence Mining from Comparable Corpora. In Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT 2012), pp. 137-144, Trento, Italy, May 28-30, 2012

[49] Ștefănescu, D. (2012). Mining for Term Translations in Comparable Corpora, in Proceedings of the 5th Workshop on Building and Using Comparable Corpora (BUCC 2012), Istanbul, Turkey

[50] Toma, S.-A., Stan, A., Pura, M.-L., and Bí¢rsan, T. (2017). MaRePhoR- An open access machinereadable phonetic dictionary for romanian. In 2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pages 1-6. IEEE. https://doi.org/10.1109/SPED.2017.7990435

[51] Tufiș, D. and Cristea, D. and Stamou, S. (2004). BalkaNet: Aims, Methods, Results and Perspectives. A General Overview. In Romanian Journal on Information Science and Technology, Special Issue on BalkaNet. (ed. Tufiș, Dan). vol. 7, no. 2-3, pp. 9-34, 2004

[52] Tufiș, D. (2012). Finding Translation Examples for Under-Resourced Language Pairs or for Narrow Domains

the Case for Machine Translation. In Computer Science Journal of Moldova, Academy of Sciences of Moldova, Institute of Mathematics and Computer Science, ISSN 1561- 4042, vol.20, no.2(59), 2012, pp. 1-19.

[53] Tufiș, D., Dumitrescu, D., S,. (2012). Cascaded Phrase-Based Statistical Machine Translation Systems. In Proceedings of the 16th EAMT Conference, Trento, Italy, pages 129-136

[54] Tufiș, D. and Cristea, D. (2017). An outlook over corola: The reference corpus of contemporary written and spoken corpus. In Proceedings of SpeD conference, (invited talk), Bucharest, Romania.

[55] Tufiș, D., Barbu Mititelu, V., Irimia, E., Mitrofan, M., Ion, R., and George, C. (2019). Making pepper understand and respond in romanian. In the 22nd International Conference on Control Systems and Computer Science. https://doi.org/10.1109/CSCS.2019.00122

[56] Tufiș, D., Barbu Mititelu, V., Irimia, E., Păiș, V., Ion, R., Diewald, N., Mitrofan, M., Onofrei, M. (2019). Little Strokes Fell Great Oaks. Creating Corola, The Reference Corpus of Contemporary Romanian. Revue roumaine de linguistique, In Revue roumaine de linguistique, No./Issue 3, 2019, pages 227-240.

[57] Tufiș, D. Mitrofan, M., Păiș, V., Ion, R., Coman, A. (2020) Collection and Annotation of the Romanian Legal Corpus. In Proceedings of The 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, pp. 2766-2770, May 2020

[58] Váradi, T., Koeva, S., Yamalov, M., Tadic, M., Sass, B., Niton, B., Ogrodniczuk, M., Pezik, P., Barbu Mititelu, V., Ion, R., Irimia, E., Mitrofan, M., Păiș, V., Tufiș, D., Garabík, R., Krek, S., Repar, A., Rihtar, M., and Brank, J. (2020). The MARCELL Legislative Corpus. In Proceedings of The 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, pp. 3754-3761, May 2020

[59] L. A. Zadeh, D. Tufiș F. Filip, I. Dzitac (eds), From Natural Language to Soft Computing: New Paradigms in Artificial Intelligence, Ed. Acad. Romí¢ne, 2008, 226 pages, ISBN: 978-973- 27-1678-6;

Romanian Language Technology — a view from an academic perspective

Authors

DOI:

Keywords:

Abstract

Author Biography

Dan Tufiș, Research Institute for Artificial Intelligence, Romanian Academy

References

Additional Files

Published

Issue

Section

License

Most read articles by the same author(s)