An Extension of the VSM Documents Representation

  • Lucian Nicolae Vintan "Lucian Blaga" University Sibiu
  • Daniel Ionel Morariu "Lucian Blaga" University Sibiu
  • Radu George Cretulescu "Lucian Blaga" University Sibiu
  • Maria Vintan "Lucian Blaga" University Sibiu

Abstract

In this paper we will present a new approach regarding the documents representation in order to be used in classification and/or clustering algorithms. In our new representation we will start from the classical "bag-of-words" representation but we will augment each word with its correspondent part-of-speech. Thus we will introduce a new concept called hyper-vectors where each document is represented in a hyper-space where each dimension is a different part-of-speech component. For each dimension the document is represented using the Vector Space Model (VSM). In this work we will use only five different parts of speech: noun, verb, adverb, adjective and others. In the hyper-space each dimension has a different weight. To compute the similarity between two documents we have developed a new hyper-cosine formula. Some interesting classification experiments are presented as validation cases.

References

[1] Brown University Standard Corpus of Present-Day American English (Brown Corpus), [Online] http://icame.uib.no/brown/bcm.html, accessed in April 2014.

[2] Chakrabarti S.(2003); Mining the Web- Discovering Knowledge from Hypertext Data, Morgan Kaufmann Press, 2003.

[3] Cretulescu R., David A., Morariu D., Vintan L. (2014); Part of Speech Tagging with Naive Bayes Methods, Proceedings of The 18-th International Conference on System Theory, Control and Computing, Sinaia (Romania), doi: 10.1109/ICSTCC.2014.6982457, 446-451, 2014.
https://doi.org/10.1109/ICSTCC.2014.6982457

[4] Cretulescu R., David A., Morariu D., Vintan L. (2015); Part of Speech Labeling for Reuters DataBase, Proc. of The 19-th International Conference on System Theory, Control and Computing, Gradistea (Romania), doi: 10.1109/ICSTCC.2015.7321279, 117-122, 2015.
https://doi.org/10.1109/ICSTCC.2015.7321279

[5] Han J., Kamber M. (2001); Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2001.

[6] Manning D., Schütze H. (1999); Foundations of Statistical Natural Language Processing, MIT Press, ISBN: 987–0–262–133360–9, 1999.

[7] Mitchell T. (1999); Machine Learning, McGraw Hill Publishers, 1997.

[8] Mitkov R. (2005); The Oxford Handbook of Computational Linguistics, Oxford University Press, 2005.

[9] Morariu D. (2008); Text Mining Methods based on Support Vector Machine, MatrixRom, Bucharest, 2008.

[10] Reuters Corpus, [Online] http://about.reuters.com/researchandstandards/corpus/, Released in November 2000.

[11] Tree tagger, [Online] http://www.cis.uni-muenchen.de/ schmid/tools/TreeTagger, accessed in April 2014.
Published
2017-04-23
How to Cite
VINTAN, Lucian Nicolae et al. An Extension of the VSM Documents Representation. INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL, [S.l.], v. 12, n. 3, p. 402-414, apr. 2017. ISSN 1841-9844. Available at: <http://univagora.ro/jour/index.php/ijccc/article/view/2889>. Date accessed: 27 sep. 2020. doi: https://doi.org/10.15837/ijccc.2017.3.2889.

Keywords

documents representation, vector space model, hyper-vectors, documents similarity, classification, clustering