An Application of Latent Semantic Analysis for Text Categorization

  • Gang Kou School of Business Administration Southwestern University of Finance and Economics, Chengdu, China No.555, Liutai Ave, Wenjiang Zone Chengdu, 611130, China
  • Yi Peng School of Management and Economics University of Electronic Science and Technology of China, Chengdu, China No.2006, Xiyuan Ave, West Hi-Tech Zone Chengdu, 611731, China

Abstract

It is a challenge task to discover major topics from text, which provide a better understanding of the whole corpus and can be regarded as a text categorization problem. The goal of this paper is to apply latent semantic analysis (LSA) approach to extract common factors that representing concepts hidden in a large group of text. LSA involves three steps: the first step is to set up a term-document matrix; the second step is to transform the term frequencies into a term-document matrix using various weighting schemes; the third step performs singular value decomposition (SVD) on the matrix to reduce the dimensionality. The reduced-order SVD is the best k-dimensional approximation to the original matrix. The experiment uses more than fifteen hundreds research paper abstracts from a specific field. Because different factor solutions of the LSA suggest different levels of aggregation, this work examines thirteen solutions in the experiment. The results show that LSA is able to identify not only principle categories, but also major themes contained in the text.

References

[1] Deerwester, S.; Dumais, S.; Furnas, G.; et al. (1990). Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science, 41(6): 391-407.
http://dx.doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9

[2] Landauer, T.; Dumais, S. T. (1997). A solution to Plato's problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge, Psychological Review, 104: 211-240.
http://dx.doi.org/10.1037/0033-295X.104.2.211

[3] Landauer, T.; Foltz, P.; Laham, D. (1998). Introduction to Latent Semantic Analysis, Discourse Processes, 25: 259-284.
http://dx.doi.org/10.1080/01638539809545028

[4] Kou, G.; Lou, C. (2012). Multiple Factor Hierarchical Clustering Algorithm for Large Scale Web Page and Search Engine Clickstream Data, Annals of Operations Research, 197(1)25: 123-134.

[5] Sidorova, A.; Evangelopoulos, N.; Valacich, J. S.; et al. (2008). Uncovering the intellectual core of the information systems discipline, MIS Quarterly, 32(3): 467-482.

[6] Dumais, S. T.; Furnas, G. W.; Landauer, T. K.;et al (1988). Using latent semantic analysis to improve information retrieval, Proceedings of CHIŻ88 Conference on Human Factors in Computing Systems, 281-285.

[7] Dumais, S. T. (2004). Latent Semantic Analysis, Annual Review of Information Science and Technology, 38: 189-230.
http://dx.doi.org/10.1002/aris.1440380105

[8] Gansterer, W.N.; Janecek, A.G.K.; Neumayer, R. (2008). In M. W. Berry and M. Castellanos (eds.), Survey of Text Mining: Clustering, Classification, and Retrieval, Second Edition (pp. 165-183). Springer
http://dx.doi.org/10.1007/978-1-84800-046-9_9

[9] Wallenius, J.; Dyer, J. S.; Fisburn, P. C.; et al. (2008). Multiple Criteria Decision Making, Multiattribute Utility Theory: Recent Accomplishments and What Lies Ahead, Management Science, 54(7): 1336-1349.
http://dx.doi.org/10.1287/mnsc.1070.0838

[10] Stewart T. J. (1992). A critical survey on the status of multiple criteria decision making theory and practice, OMEGA, 20(5/6): 569-586.
http://dx.doi.org/10.1016/0305-0483(92)90003-P

[11] Dyer, J. S.; Fisburn, P. C.; Steuer, R. E.; et al. (1992). Multiple criteria decision making, multiattribute utility theory: the next ten years, Management Science, 38(5): 645¨C654.

[12] Urli, B.; Nadeau, R. (1999). Evolution of multi-criteria analysis: a scientometric analysis, J. Multi-Crit. Decis. Anal., 8: 31-43.

[13] International society on MCDM, (2009). http://www.mcdmsociety.org, Accessed 24 Jun 2009

[14] Steuer, R. E.; Gardiner, L. R.; Gray, J. (1996). A bibliographical survey of the activities and international nature of multiple criteria decision making, J. Multi-Crit. Decis. Anal., 5: 195¨C217.

[15] Bragge, J.; Korhonen, P.; Wallenius, J.; et al. (2008). Bibliometric Analysis of Multiple Criteria Decision Making/Multiattribute Utility Theory, International Society on Multiple Criteria Decision Making, Accessed 11 June 2009.

[16] Fox, C. (1992). Lexical Analysis and Stoplists. In W. B. Frakes and R. Baeza-Yates (eds.), Information Retrieval: Data Structures and Algorithms (pp. 102-130). Upper Saddle River, NJ: Prentice-Hall.

[17] Han, J.; Kamber, M. (2006). Data Mining: Concepts and Techniques, 2nd edition. San Francisco, CA: Morgan Kaufmann Publishers.

[18] Stopwords. (2008). Webconfs.com, http://www.webconfs.com/stop-words.php, Accessed 10 August, 2008.

[19] SQL Sever 2005. Microsoft.com, http://www.microsoft.com/sqlserver/2005/en/us/ overview.aspx, Accessed 1 Feb 2009.

[20] Porter, M. F. (1980). An algorithm for suffix stripping, Program, 14(3): 130-137.
http://dx.doi.org/10.1108/eb046814

[21] Porter, M. F. (2008). The Porter Stemming Algorithm. http://tartarus.org/martin/ Porter- Stemmer/. Accessed 22 Feb, 2009.

[22] Baeza-Yates, R.; Ribeiro-Neto, B. (1999). Modern Information Retrieval, Addison-Wesley, Wokingham, UK.

[23] LingPipe (2008). http://alias-i.com/lingpipe/index.html, Accessed 1 March 2009.

[24] Langen, D. (1989). An (interactive) decision support system for bank asset liability management, Decision Support Systems, 5(4): 389-401.
http://dx.doi.org/10.1016/0167-9236(89)90018-3

[25] Geiger, M. J. (2007). On operators and search space topology in multi-objective flow shop scheduling, European Journal of Operational Research, 181(1): 195-206.
http://dx.doi.org/10.1016/j.ejor.2006.06.010

[26] Przybylski, A.; Gandibleux, X.; Ehrgott, M. (2008). Two phase algorithms for the biobjective assignment problem. European Journal of Operational Research, 185(2): 509-533.
http://dx.doi.org/10.1016/j.ejor.2006.12.054

[27] Ergu, D.; Kou, G. (2012). Questionnaire Design Improvement and Missing Item Scores Estimation for Rapid and Efficient Decision Making, Annals of Operations Research, 197(1):5¨C23, DOI 10.1007/s10479-011-0922-3.

[28] Shi, Y. (2001). Multiple Criteria Multiple Constraint-level (MC2) Linear Programming: Concepts, Techniques and Applications, World Scientific Publishing, 539 pages.
http://dx.doi.org/10.1142/4000

[29] Yu, L.; Wang, S.; Lai, K. K. (2009). An intelligent-agent-based fuzzy group decision making model for financial multicriteria decision support: The case of credit scoring. European Journal of Operational Research, 195(3): 942-959.
http://dx.doi.org/10.1016/j.ejor.2007.11.025

[30] Kou, G.; Peng, Y.; Wang, G.X. (2014a). Evaluation of Clustering Algorithms for Financial Risk Analysis using MCDM Methods, Information Sciences, 27:1-12.
http://dx.doi.org/10.1016/j.ins.2014.02.137

[31] Kou, G.; Peng, Y.; Lu, C. (2014b). An MCDM Approach to Evaluate Bank Loan Default Models, Technological and Economic Development of Economy, 20(2): 278-297
http://dx.doi.org/10.3846/20294913.2014.913275

[32] Ergu, D.; Kou, G.; Shi, Y.; et al. (2011). Analytic Network Process in Risk Assessment and Decision Analysis, Computers & Operations Research, DOI: 10.1016/j.cor.2011.03.005.
http://dx.doi.org/10.1016/j.cor.2011.03.005

[33] Kou, G.; and Lin, C. (2014) A cosine maximization method for the priority vector derivation in AHP, European Journal of Operational Research, 235: 225-232.
http://dx.doi.org/10.1016/j.ejor.2013.10.019

[34] Montibeller, G.; Belton, V.; Lima, M.V.A. (2007). Supporting factoring transactions in Brazil using reasoning maps: a language-based DSS for evaluating accounts receivable. Decision Support Systems, 42(4): 2085-2092.
http://dx.doi.org/10.1016/j.dss.2004.11.011

[35] Yevseyeva, I.; Miettinen, K.; Rasanen, P. (2008). Verbal ordinal classification with multicriteria decision aiding. European Journal of Operational Research, 185(3): 964-983.
http://dx.doi.org/10.1016/j.ejor.2006.03.058

[36] Hamalainen, R. P. (2003). Decisionarium-aiding decisions, negotiating and collecting opinions on the web. Journal of Multicriteria Decision Analysis, 12(2-3): 101-110.
http://dx.doi.org/10.1002/mcda.350

[37] Yu, P.L. (1991). Habitual domains, Operations Research, 39(6): 869-876.
http://dx.doi.org/10.1287/opre.39.6.869

[38] Chiu, Y.; Shyu, J. Z.; Tzeng, G. H. (2004). Fuzzy MCDM for evaluating the e-commerce strategy, International Journal of Computer Applications in Technology, 19(1): 12-22.
http://dx.doi.org/10.1504/IJCAT.2004.003656

[39] Kameshwaran, S.; Narahari, Y.; Rosa, C. H.; et al. (2007). Multiattribute electronic procurement using goal programming. European Journal of Operational Research, 179(2): 518-536.
http://dx.doi.org/10.1016/j.ejor.2006.01.010

[40] Moreno-Jimenez, J. M.; Polasek, W. (2003). e-democracy and knowledge. A multicriteria framework for the new democratic era. Journal of Multi-Criteria Decision Analysis, 12(2-3): 163-176.
http://dx.doi.org/10.1002/mcda.354

[41] Zeleny, M. (1998). Multiple criteria decision making: eight concepts of optimality, Human Systems Management, 17(2): 97-107.

[42] Dong, J.; Zhang, D.; Yan, H.; et al. (2005). Multitiered Supply Chain Networks: Multicriteria Decision Making Under Uncertainty. Annals of Operations Research, 135(1): 155-178.
http://dx.doi.org/10.1007/s10479-005-6239-3

[43] Gomes, E. G.; Lins, M. (2002). Integrating geographical information systems and multicriteria methods: A case study. Annals of Operations Research, 116(1-4): 243-269.
http://dx.doi.org/10.1023/A:1021344700828

[44] Bisdorff, R. (2002); Electre-like clustering from a pairwise fuzzy proximity index, European Journal of Operational Research, 138(2): 320-331.
http://dx.doi.org/10.1016/S0377-2217(01)00249-1

[45] Lenca, P.; Meyer, P.; Vaillant, B.; et al. (2008). On selecting interestingness measures for association rules: User oriented description and multiple criteria decision aid. European Journal of Operational Research, 184(2): 610-626.
http://dx.doi.org/10.1016/j.ejor.2006.10.059

[46] Malakooti, B.; Zhou, Y. Q. (1994). Feedforward Artificial Neural Networks for Solving Discrete Multiple Criteria Decision Making Problems, Management Science, 40(11): 1542- 1561.
http://dx.doi.org/10.1287/mnsc.40.11.1542

[47] Wang, J. (1994). A neural network approach to modeling fuzzy preference relations for multiple criteria decision making. Computers and Operations Research, 21(9): 991-1000.
http://dx.doi.org/10.1016/0305-0548(94)90070-1
Published
2015-06-01
How to Cite
KOU, Gang; PENG, Yi. An Application of Latent Semantic Analysis for Text Categorization. INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL, [S.l.], v. 10, n. 3, p. 357-369, june 2015. ISSN 1841-9844. Available at: <http://univagora.ro/jour/index.php/ijccc/article/view/1923>. Date accessed: 13 july 2020. doi: https://doi.org/10.15837/ijccc.2015.3.1923.

Keywords

Latent Semantic Analysis, Topic extraction, Text Mining, Information Retrieval