An Extension of the VSM Documents Representation

Lucian Nicolae Vintan, Daniel Ionel Morariu, Radu George Cretulescu, Maria Vintan

Abstract


In this paper we will present a new approach regarding the documents representation in order to be used in classification and/or clustering algorithms. In our new representation we will start from the classical "bag-of-words" representation but we will augment each word with its correspondent part-of-speech. Thus we will introduce a new concept called hyper-vectors where each document is represented in a hyper-space where each dimension is a different part-of-speech component. For each dimension the document is represented using the Vector Space Model (VSM). In this work we will use only five different parts of speech: noun, verb, adverb, adjective and others. In the hyper-space each dimension has a different weight. To compute the similarity between two documents we have developed a new hyper-cosine formula. Some interesting classification experiments are presented as validation cases.

Keywords


documents representation, vector space model, hyper-vectors, documents similarity, classification, clustering

Full Text:

PDF

References


Brown University Standard Corpus of Present-Day American English (Brown Corpus), [Online] http://icame.uib.no/brown/bcm.html, accessed in April 2014.

Chakrabarti S.(2003); Mining the Web- Discovering Knowledge from Hypertext Data, Morgan Kaufmann Press, 2003.

Cretulescu R., David A., Morariu D., Vintan L. (2014); Part of Speech Tagging with Naive Bayes Methods, Proceedings of The 18-th International Conference on System Theory, Control and Computing, Sinaia (Romania), doi: 10.1109/ICSTCC.2014.6982457, 446-451, 2014.
https://doi.org/10.1109/ICSTCC.2014.6982457

Cretulescu R., David A., Morariu D., Vintan L. (2015); Part of Speech Labeling for Reuters DataBase, Proc. of The 19-th International Conference on System Theory, Control and Computing, Gradistea (Romania), doi: 10.1109/ICSTCC.2015.7321279, 117-122, 2015.
https://doi.org/10.1109/ICSTCC.2015.7321279

Han J., Kamber M. (2001); Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2001.

Manning D., Schütze H. (1999); Foundations of Statistical Natural Language Processing, MIT Press, ISBN: 987–0–262–133360–9, 1999.

Mitchell T. (1999); Machine Learning, McGraw Hill Publishers, 1997.

Mitkov R. (2005); The Oxford Handbook of Computational Linguistics, Oxford University Press, 2005.

Morariu D. (2008); Text Mining Methods based on Support Vector Machine, MatrixRom, Bucharest, 2008.

Reuters Corpus, [Online] http://about.reuters.com/researchandstandards/corpus/, Released in November 2000.

Tree tagger, [Online] http://www.cis.uni-muenchen.de/ schmid/tools/TreeTagger, accessed in April 2014.




DOI: http://dx.doi.org/10.15837/ijccc.2017.3.2889



Copyright (c) 2017 Vintan Nicolae Lucian, Morariu Ionel Daniel, Cretulescu George Radu, Vintan Maria



CC-BY-NC  License for Website User

Articles published in IJCCC user license are protected by copyright.

Users can access, download, copy, translate the IJCCC articles for non-commercial purposes provided that users, but cannot redistribute, display or adapt:

  • Cite the article using an appropriate bibliographic citation: author(s), article title, journal, volume, issue, page numbers, year of publication, DOI, and the link to the definitive published version on IJCCC website;
  • Maintain the integrity of the IJCCC article;
  • Retain the copyright notices and links to these terms and conditions so it is clear to other users what can and what cannot be done with the  article;
  • Ensure that, for any content in the IJCCC article that is identified as belonging to a third party, any re-use complies with the copyright policies of that third party;
  • Any translations must prominently display the statement: "This is an unofficial translation of an article that appeared in IJCCC. Agora University  has not endorsed this translation."

This is a non commercial license where the use of published articles for commercial purposes is forbiden. 

Commercial purposes include: 

  • Copying or downloading IJCCC articles, or linking to such postings, for further redistribution, sale or licensing, for a fee;
  • Copying, downloading or posting by a site or service that incorporates advertising with such content;
  • The inclusion or incorporation of article content in other works or services (other than normal quotations with an appropriate citation) that is then available for sale or licensing, for a fee;
  • Use of IJCCC articles or article content (other than normal quotations with appropriate citation) by for-profit organizations for promotional purposes, whether for a fee or otherwise;
  • Use for the purposes of monetary reward by means of sale, resale, license, loan, transfer or other form of commercial exploitation;

    The licensor cannot revoke these freedoms as long as you follow the license terms.

[End of CC-BY-NC  License for Website User]


INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL (IJCCC), With Emphasis on the Integration of Three Technologies (C & C & C),  ISSN 1841-9836.

IJCCC was founded in 2006,  at Agora University, by  Ioan DZITAC (Editor-in-Chief),  Florin Gheorghe FILIP (Editor-in-Chief), and  Misu-Jan MANOLESCU (Managing Editor).

This journal is a member of, and subscribes to the principles of, the Committee on Publication Ethics (COPE).

Ioan  DZITAC (A. Editor-in-Chief) at COPE European Seminar, Bruxelles, 2015:

IJCCC is covered/indexed/abstracted in Science Citation Index Expanded (since vol.1(S),  2006). IF=1.374 in JCR2016.

IJCCC is indexed in Scopus from 2008 (SNIP2016 = 0.701, SJR2016 =0.319):

Nomination by Elsevier for Journal Excellence Award Romania 2015 (SNIP2014 = 1.029): Elsevier/ Scopus

IJCCC was nominated by Elsevier for Journal Excellence Award - "Scopus Awards Romania 2015" (SNIP2014 = 1.029).

IJCCC is in Top 3 of 157 Romanian journals indexed by Scopus (in all fields) and No.1 in Computer Science field by Elsevier/ Scopus.