An Extension of the VSM Documents Representation
Keywords:
documents representation, vector space model, hyper-vectors, documents similarity, classification, clusteringAbstract
In this paper we will present a new approach regarding the documents representation in order to be used in classification and/or clustering algorithms. In our new representation we will start from the classical "bag-of-words" representation but we will augment each word with its correspondent part-of-speech. Thus we will introduce a new concept called hyper-vectors where each document is represented in a hyper-space where each dimension is a different part-of-speech component. For each dimension the document is represented using the Vector Space Model (VSM). In this work we will use only five different parts of speech: noun, verb, adverb, adjective and others. In the hyper-space each dimension has a different weight. To compute the similarity between two documents we have developed a new hyper-cosine formula. Some interesting classification experiments are presented as validation cases.References
Brown University Standard Corpus of Present-Day American English (Brown Corpus), [Online] http://icame.uib.no/brown/bcm.html, accessed in April 2014.
Chakrabarti S.(2003); Mining the Web- Discovering Knowledge from Hypertext Data, Morgan Kaufmann Press, 2003.
Cretulescu R., David A., Morariu D., Vintan L. (2014); Part of Speech Tagging with Naive Bayes Methods, Proceedings of The 18-th International Conference on System Theory, Control and Computing, Sinaia (Romania), doi: 10.1109/ICSTCC.2014.6982457, 446-451, 2014. https://doi.org/10.1109/ICSTCC.2014.6982457
Cretulescu R., David A., Morariu D., Vintan L. (2015); Part of Speech Labeling for Reuters DataBase, Proc. of The 19-th International Conference on System Theory, Control and Computing, Gradistea (Romania), doi: 10.1109/ICSTCC.2015.7321279, 117-122, 2015. https://doi.org/10.1109/ICSTCC.2015.7321279
Han J., Kamber M. (2001); Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2001.
Manning D., Schütze H. (1999); Foundations of Statistical Natural Language Processing, MIT Press, ISBN: 987-0-262-133360-9, 1999.
Mitchell T. (1999); Machine Learning, McGraw Hill Publishers, 1997.
Mitkov R. (2005); The Oxford Handbook of Computational Linguistics, Oxford University Press, 2005.
Morariu D. (2008); Text Mining Methods based on Support Vector Machine, MatrixRom, Bucharest, 2008.
Reuters Corpus, [Online] http://about.reuters.com/researchandstandards/corpus/, Released in November 2000.
Tree tagger, [Online] http://www.cis.uni-muenchen.de/ schmid/tools/TreeTagger, accessed in April 2014.
Published
Issue
Section
License
ONLINE OPEN ACCES: Acces to full text of each article and each issue are allowed for free in respect of Attribution-NonCommercial 4.0 International (CC BY-NC 4.0.
You are free to:
-Share: copy and redistribute the material in any medium or format;
-Adapt: remix, transform, and build upon the material.
The licensor cannot revoke these freedoms as long as you follow the license terms.
DISCLAIMER: The author(s) of each article appearing in International Journal of Computers Communications & Control is/are solely responsible for the content thereof; the publication of an article shall not constitute or be deemed to constitute any representation by the Editors or Agora University Press that the data presented therein are original, correct or sufficient to support the conclusions reached or that the experiment design or methodology is adequate.