An Extension of the VSM Documents Representation

Lucian Nicolae Vintan; Daniel Ionel Morariu; Radu George Cretulescu; Maria Vintan

Authors

Lucian Nicolae Vintan "Lucian Blaga" University Sibiu
Daniel Ionel Morariu "Lucian Blaga" University Sibiu
Radu George Cretulescu "Lucian Blaga" University Sibiu
Maria Vintan "Lucian Blaga" University Sibiu

Keywords:

documents representation, vector space model, hyper-vectors, documents similarity, classification, clustering

Abstract

In this paper we will present a new approach regarding the documents representation in order to be used in classification and/or clustering algorithms. In our new representation we will start from the classical "bag-of-words" representation but we will augment each word with its correspondent part-of-speech. Thus we will introduce a new concept called hyper-vectors where each document is represented in a hyper-space where each dimension is a different part-of-speech component. For each dimension the document is represented using the Vector Space Model (VSM). In this work we will use only five different parts of speech: noun, verb, adverb, adjective and others. In the hyper-space each dimension has a different weight. To compute the similarity between two documents we have developed a new hyper-cosine formula. Some interesting classification experiments are presented as validation cases.

References

Brown University Standard Corpus of Present-Day American English (Brown Corpus), [Online] http://icame.uib.no/brown/bcm.html, accessed in April 2014.

Chakrabarti S.(2003); Mining the Web- Discovering Knowledge from Hypertext Data, Morgan Kaufmann Press, 2003.

Cretulescu R., David A., Morariu D., Vintan L. (2014); Part of Speech Tagging with Naive Bayes Methods, Proceedings of The 18-th International Conference on System Theory, Control and Computing, Sinaia (Romania), doi: 10.1109/ICSTCC.2014.6982457, 446-451, 2014. https://doi.org/10.1109/ICSTCC.2014.6982457

Cretulescu R., David A., Morariu D., Vintan L. (2015); Part of Speech Labeling for Reuters DataBase, Proc. of The 19-th International Conference on System Theory, Control and Computing, Gradistea (Romania), doi: 10.1109/ICSTCC.2015.7321279, 117-122, 2015. https://doi.org/10.1109/ICSTCC.2015.7321279

Han J., Kamber M. (2001); Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2001.

Manning D., Schütze H. (1999); Foundations of Statistical Natural Language Processing, MIT Press, ISBN: 987-0-262-133360-9, 1999.

Mitchell T. (1999); Machine Learning, McGraw Hill Publishers, 1997.

Mitkov R. (2005); The Oxford Handbook of Computational Linguistics, Oxford University Press, 2005.

Morariu D. (2008); Text Mining Methods based on Support Vector Machine, MatrixRom, Bucharest, 2008.

Reuters Corpus, [Online] http://about.reuters.com/researchandstandards/corpus/, Released in November 2000.

Tree tagger, [Online] http://www.cis.uni-muenchen.de/ schmid/tools/TreeTagger, accessed in April 2014.

An Extension of the VSM Documents Representation

Authors

Keywords:

Abstract

References

Published

Issue

Section

License

Most read articles by the same author(s)