An Extension of the VSM Documents Representation

Lucian Nicolae Vintan, Daniel Ionel Morariu, Radu George Cretulescu, Maria Vintan

Abstract


In this paper we will present a new approach regarding the documents representation in order to be used in classification and/or clustering algorithms. In our new representation we will start from the classical "bag-of-words" representation but we will augment each word with its correspondent part-of-speech. Thus we will introduce a new concept called hyper-vectors where each document is represented in a hyper-space where each dimension is a different part-of-speech component. For each dimension the document is represented using the Vector Space Model (VSM). In this work we will use only five different parts of speech: noun, verb, adverb, adjective and others. In the hyper-space each dimension has a different weight. To compute the similarity between two documents we have developed a new hyper-cosine formula. Some interesting classification experiments are presented as validation cases.

Keywords


documents representation, vector space model, hyper-vectors, documents similarity, classification, clustering

Full Text:

PDF

References


Brown University Standard Corpus of Present-Day American English (Brown Corpus), [Online] http://icame.uib.no/brown/bcm.html, accessed in April 2014.

Chakrabarti S.(2003); Mining the Web- Discovering Knowledge from Hypertext Data, Morgan Kaufmann Press, 2003.

Cretulescu R., David A., Morariu D., Vintan L. (2014); Part of Speech Tagging with Naive Bayes Methods, Proceedings of The 18-th International Conference on System Theory, Control and Computing, Sinaia (Romania), doi: 10.1109/ICSTCC.2014.6982457, 446-451, 2014.
https://doi.org/10.1109/ICSTCC.2014.6982457

Cretulescu R., David A., Morariu D., Vintan L. (2015); Part of Speech Labeling for Reuters DataBase, Proc. of The 19-th International Conference on System Theory, Control and Computing, Gradistea (Romania), doi: 10.1109/ICSTCC.2015.7321279, 117-122, 2015.
https://doi.org/10.1109/ICSTCC.2015.7321279

Han J., Kamber M. (2001); Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2001.

Manning D., Schütze H. (1999); Foundations of Statistical Natural Language Processing, MIT Press, ISBN: 987–0–262–133360–9, 1999.

Mitchell T. (1999); Machine Learning, McGraw Hill Publishers, 1997.

Mitkov R. (2005); The Oxford Handbook of Computational Linguistics, Oxford University Press, 2005.

Morariu D. (2008); Text Mining Methods based on Support Vector Machine, MatrixRom, Bucharest, 2008.

Reuters Corpus, [Online] http://about.reuters.com/researchandstandards/corpus/, Released in November 2000.

Tree tagger, [Online] http://www.cis.uni-muenchen.de/ schmid/tools/TreeTagger, accessed in April 2014.




DOI: http://dx.doi.org/10.15837/ijccc.2017.3.2889

Refbacks

  • There are currently no refbacks.




INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL (IJCCC), With Emphasis on the Integration of Three Technologies (C & C & C),  ISSN 1841-9836.

INDEXING AND COVERAGE:

***IJCCC is covered by THOMSON REUTERS and is indexed in ISI Web of Science/Knowledge Clarivate: Science Citation Index Expanded. 2016 Journal Citation Reports® Science Edition(Thomson Reuters/Clarivate, 2017): Subject Category: (1) Automation & Control Systems: Q4(2009,2011,2012,2013,2014,2015), Q3(2010, 2016); (2) Computer Science, Information Systems: Q4(2009,2010,2011,2012,2015), Q3(2013,2014, 2016). Impact Factor/3 years in JCR: 0.373(2009), 0.650 (2010), 0.438(2011); 0.441(2012), 0.694(2013), 0.746(2014), 0.627(2015). Impact Factor/5 years in JCR: 0.436(2012), 0.622(2013), 0.739(2014), 0.635(2015), 1.374(2016).

*** IJCCC is also indexed by SCOPUS (SNIP2015= 0.784): Subject Category: (1) Computational Theory and Mathematics: Q4(2009,2010,2012,2015), Q3(2011,2013,2014, 2016); (2) Computer Networks and Communications: Q4(2009), Q3(2010, 2012, 2013, 2015), Q2(2011, 2014, 2016); (3) Computer Science Applications: Q4(2009), Q3(2010, 2011, 2012, 2013, 2014, 2015, 2016). SJR: 0.178(2009), 0.339(2010), 0.369(2011), 0.292(2012), 0.378(2013), 0.420(2014), 0.319(2015), 0.319(2016). CiteScore 2016 in Scopus: 1.06.

IJCCC was founded in 2006,  at Agora University, by  Ioan DZITAC (A. Editor-in-Chief),  Florin Gheorghe FILIP (Editor-in-Chief), and  Misu-Jan MANOLESCU (Managing Editor):

This journal is a member of, and subscribes to the principles of, the Committee on Publication Ethics (COPE).

Ioan  DZITAC (A. Editor-in-Chief) at COPE European Seminar, Bruxelles, 2015:

IJCCC is covered/indexed/abstracted in Science Citation Index Expanded (since vol.1(S),  2006). IF=1.374 in JCR2016.

IJCCC is indexed in Scopus from 2008 (SNIP2015 = 0.78, SJR2015 =0.319):

SCImago Journal & Country Rank

 

Nomination by Elsevier for Journal Excellence Award Romania 2015 (SNIP2014 = 1.029): Elsevier/ Scopus

IJCCC was nominated by Elsevier for Journal Excellence Award - "Scopus Awards Romania 2015" (SNIP2014 = 1.029).

IJCCC is in Top 3 of 157 Romanian journals indexed by Scopus (in all fields) and No.1 in Computer Science field: Elsevier/ Scopus.

Elsevier:How do you feel about being nominated for Scopus Awards 2015? Interview.