An Extension of the VSM Documents Representation

Authors

  • Lucian Nicolae Vintan "Lucian Blaga" University Sibiu
  • Daniel Ionel Morariu "Lucian Blaga" University Sibiu
  • Radu George Cretulescu "Lucian Blaga" University Sibiu
  • Maria Vintan "Lucian Blaga" University Sibiu

Keywords:

documents representation, vector space model, hyper-vectors, documents similarity, classification, clustering

Abstract

In this paper we will present a new approach regarding the documents representation in order to be used in classification and/or clustering algorithms. In our new representation we will start from the classical "bag-of-words" representation but we will augment each word with its correspondent part-of-speech. Thus we will introduce a new concept called hyper-vectors where each document is represented in a hyper-space where each dimension is a different part-of-speech component. For each dimension the document is represented using the Vector Space Model (VSM). In this work we will use only five different parts of speech: noun, verb, adverb, adjective and others. In the hyper-space each dimension has a different weight. To compute the similarity between two documents we have developed a new hyper-cosine formula. Some interesting classification experiments are presented as validation cases.

References

Brown University Standard Corpus of Present-Day American English (Brown Corpus), [Online] http://icame.uib.no/brown/bcm.html, accessed in April 2014.

Chakrabarti S.(2003); Mining the Web- Discovering Knowledge from Hypertext Data, Morgan Kaufmann Press, 2003.

Cretulescu R., David A., Morariu D., Vintan L. (2014); Part of Speech Tagging with Naive Bayes Methods, Proceedings of The 18-th International Conference on System Theory, Control and Computing, Sinaia (Romania), doi: 10.1109/ICSTCC.2014.6982457, 446-451, 2014. https://doi.org/10.1109/ICSTCC.2014.6982457

Cretulescu R., David A., Morariu D., Vintan L. (2015); Part of Speech Labeling for Reuters DataBase, Proc. of The 19-th International Conference on System Theory, Control and Computing, Gradistea (Romania), doi: 10.1109/ICSTCC.2015.7321279, 117-122, 2015. https://doi.org/10.1109/ICSTCC.2015.7321279

Han J., Kamber M. (2001); Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2001.

Manning D., Schütze H. (1999); Foundations of Statistical Natural Language Processing, MIT Press, ISBN: 987-0-262-133360-9, 1999.

Mitchell T. (1999); Machine Learning, McGraw Hill Publishers, 1997.

Mitkov R. (2005); The Oxford Handbook of Computational Linguistics, Oxford University Press, 2005.

Morariu D. (2008); Text Mining Methods based on Support Vector Machine, MatrixRom, Bucharest, 2008.

Reuters Corpus, [Online] http://about.reuters.com/researchandstandards/corpus/, Released in November 2000.

Tree tagger, [Online] http://www.cis.uni-muenchen.de/ schmid/tools/TreeTagger, accessed in April 2014.

Published

2017-04-23

Most read articles by the same author(s)

Obs.: This plugin requires at least one statistics/report plugin to be enabled. If your statistics plugins provide more than one metric then please also select a main metric on the admin's site settings page and/or on the journal manager's settings pages.