An Application of Latent Semantic Analysis for Text Categorization

Authors

  • Gang Kou School of Business Administration Southwestern University of Finance and Economics, Chengdu, China No.555, Liutai Ave, Wenjiang Zone Chengdu, 611130, China
  • Yi Peng School of Management and Economics University of Electronic Science and Technology of China, Chengdu, China No.2006, Xiyuan Ave, West Hi-Tech Zone Chengdu, 611731, China

Keywords:

Latent Semantic Analysis, Topic extraction, Text Mining, Information Retrieval

Abstract

It is a challenge task to discover major topics from text, which provide a better understanding of the whole corpus and can be regarded as a text categorization problem. The goal of this paper is to apply latent semantic analysis (LSA) approach to extract common factors that representing concepts hidden in a large group of text. LSA involves three steps: the first step is to set up a term-document matrix; the second step is to transform the term frequencies into a term-document matrix using various weighting schemes; the third step performs singular value decomposition (SVD) on the matrix to reduce the dimensionality. The reduced-order SVD is the best k-dimensional approximation to the original matrix. The experiment uses more than fifteen hundreds research paper abstracts from a specific field. Because different factor solutions of the LSA suggest different levels of aggregation, this work examines thirteen solutions in the experiment. The results show that LSA is able to identify not only principle categories, but also major themes contained in the text.

References

Deerwester, S.; Dumais, S.; Furnas, G.; et al. (1990). Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science, 41(6): 391-407. http://dx.doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9

Landauer, T.; Dumais, S. T. (1997). A solution to Plato's problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge, Psychological Review, 104: 211-240. http://dx.doi.org/10.1037/0033-295X.104.2.211

Landauer, T.; Foltz, P.; Laham, D. (1998). Introduction to Latent Semantic Analysis, Discourse Processes, 25: 259-284. http://dx.doi.org/10.1080/01638539809545028

Kou, G.; Lou, C. (2012). Multiple Factor Hierarchical Clustering Algorithm for Large Scale Web Page and Search Engine Clickstream Data, Annals of Operations Research, 197(1)25: 123-134.

Sidorova, A.; Evangelopoulos, N.; Valacich, J. S.; et al. (2008). Uncovering the intellectual core of the information systems discipline, MIS Quarterly, 32(3): 467-482.

Dumais, S. T.; Furnas, G. W.; Landauer, T. K.;et al (1988). Using latent semantic analysis to improve information retrieval, Proceedings of CHIŻ88 Conference on Human Factors in Computing Systems, 281-285.

Dumais, S. T. (2004). Latent Semantic Analysis, Annual Review of Information Science and Technology, 38: 189-230. http://dx.doi.org/10.1002/aris.1440380105

Gansterer, W.N.; Janecek, A.G.K.; Neumayer, R. (2008). In M. W. Berry and M. Castellanos (eds.), Survey of Text Mining: Clustering, Classification, and Retrieval, Second Edition (pp. 165-183). Springer http://dx.doi.org/10.1007/978-1-84800-046-9_9

Wallenius, J.; Dyer, J. S.; Fisburn, P. C.; et al. (2008). Multiple Criteria Decision Making, Multiattribute Utility Theory: Recent Accomplishments and What Lies Ahead, Management Science, 54(7): 1336-1349. http://dx.doi.org/10.1287/mnsc.1070.0838

Stewart T. J. (1992). A critical survey on the status of multiple criteria decision making theory and practice, OMEGA, 20(5/6): 569-586. http://dx.doi.org/10.1016/0305-0483(92)90003-P

Dyer, J. S.; Fisburn, P. C.; Steuer, R. E.; et al. (1992). Multiple criteria decision making, multiattribute utility theory: the next ten years, Management Science, 38(5): 645¨C654.

Urli, B.; Nadeau, R. (1999). Evolution of multi-criteria analysis: a scientometric analysis, J. Multi-Crit. Decis. Anal., 8: 31-43.

International society on MCDM, (2009). http://www.mcdmsociety.org, Accessed 24 Jun 2009

Steuer, R. E.; Gardiner, L. R.; Gray, J. (1996). A bibliographical survey of the activities and international nature of multiple criteria decision making, J. Multi-Crit. Decis. Anal., 5: 195¨C217.

Bragge, J.; Korhonen, P.; Wallenius, J.; et al. (2008). Bibliometric Analysis of Multiple Criteria Decision Making/Multiattribute Utility Theory, International Society on Multiple Criteria Decision Making, Accessed 11 June 2009.

Fox, C. (1992). Lexical Analysis and Stoplists. In W. B. Frakes and R. Baeza-Yates (eds.), Information Retrieval: Data Structures and Algorithms (pp. 102-130). Upper Saddle River, NJ: Prentice-Hall.

Han, J.; Kamber, M. (2006). Data Mining: Concepts and Techniques, 2nd edition. San Francisco, CA: Morgan Kaufmann Publishers.

Stopwords. (2008). Webconfs.com, http://www.webconfs.com/stop-words.php, Accessed 10 August, 2008.

SQL Sever 2005. Microsoft.com, http://www.microsoft.com/sqlserver/2005/en/us/ overview.aspx, Accessed 1 Feb 2009.

Porter, M. F. (1980). An algorithm for suffix stripping, Program, 14(3): 130-137. http://dx.doi.org/10.1108/eb046814

Porter, M. F. (2008). The Porter Stemming Algorithm. http://tartarus.org/martin/ Porter- Stemmer/. Accessed 22 Feb, 2009.

Baeza-Yates, R.; Ribeiro-Neto, B. (1999). Modern Information Retrieval, Addison-Wesley, Wokingham, UK.

LingPipe (2008). http://alias-i.com/lingpipe/index.html, Accessed 1 March 2009.

Langen, D. (1989). An (interactive) decision support system for bank asset liability management, Decision Support Systems, 5(4): 389-401. http://dx.doi.org/10.1016/0167-9236(89)90018-3

Geiger, M. J. (2007). On operators and search space topology in multi-objective flow shop scheduling, European Journal of Operational Research, 181(1): 195-206. http://dx.doi.org/10.1016/j.ejor.2006.06.010

Przybylski, A.; Gandibleux, X.; Ehrgott, M. (2008). Two phase algorithms for the biobjective assignment problem. European Journal of Operational Research, 185(2): 509-533. http://dx.doi.org/10.1016/j.ejor.2006.12.054

Ergu, D.; Kou, G. (2012). Questionnaire Design Improvement and Missing Item Scores Estimation for Rapid and Efficient Decision Making, Annals of Operations Research, 197(1):5¨C23, DOI 10.1007/s10479-011-0922-3.

Shi, Y. (2001). Multiple Criteria Multiple Constraint-level (MC2) Linear Programming: Concepts, Techniques and Applications, World Scientific Publishing, 539 pages. http://dx.doi.org/10.1142/4000

Yu, L.; Wang, S.; Lai, K. K. (2009). An intelligent-agent-based fuzzy group decision making model for financial multicriteria decision support: The case of credit scoring. European Journal of Operational Research, 195(3): 942-959. http://dx.doi.org/10.1016/j.ejor.2007.11.025

Kou, G.; Peng, Y.; Wang, G.X. (2014a). Evaluation of Clustering Algorithms for Financial Risk Analysis using MCDM Methods, Information Sciences, 27:1-12. http://dx.doi.org/10.1016/j.ins.2014.02.137

Kou, G.; Peng, Y.; Lu, C. (2014b). An MCDM Approach to Evaluate Bank Loan Default Models, Technological and Economic Development of Economy, 20(2): 278-297 http://dx.doi.org/10.3846/20294913.2014.913275

Ergu, D.; Kou, G.; Shi, Y.; et al. (2011). Analytic Network Process in Risk Assessment and Decision Analysis, Computers & Operations Research, DOI: 10.1016/j.cor.2011.03.005. http://dx.doi.org/10.1016/j.cor.2011.03.005

Kou, G.; and Lin, C. (2014) A cosine maximization method for the priority vector derivation in AHP, European Journal of Operational Research, 235: 225-232. http://dx.doi.org/10.1016/j.ejor.2013.10.019

Montibeller, G.; Belton, V.; Lima, M.V.A. (2007). Supporting factoring transactions in Brazil using reasoning maps: a language-based DSS for evaluating accounts receivable. Decision Support Systems, 42(4): 2085-2092. http://dx.doi.org/10.1016/j.dss.2004.11.011

Yevseyeva, I.; Miettinen, K.; Rasanen, P. (2008). Verbal ordinal classification with multicriteria decision aiding. European Journal of Operational Research, 185(3): 964-983. http://dx.doi.org/10.1016/j.ejor.2006.03.058

Hamalainen, R. P. (2003). Decisionarium-aiding decisions, negotiating and collecting opinions on the web. Journal of Multicriteria Decision Analysis, 12(2-3): 101-110. http://dx.doi.org/10.1002/mcda.350

Yu, P.L. (1991). Habitual domains, Operations Research, 39(6): 869-876. http://dx.doi.org/10.1287/opre.39.6.869

Chiu, Y.; Shyu, J. Z.; Tzeng, G. H. (2004). Fuzzy MCDM for evaluating the e-commerce strategy, International Journal of Computer Applications in Technology, 19(1): 12-22. http://dx.doi.org/10.1504/IJCAT.2004.003656

Kameshwaran, S.; Narahari, Y.; Rosa, C. H.; et al. (2007). Multiattribute electronic procurement using goal programming. European Journal of Operational Research, 179(2): 518-536. http://dx.doi.org/10.1016/j.ejor.2006.01.010

Moreno-Jimenez, J. M.; Polasek, W. (2003). e-democracy and knowledge. A multicriteria framework for the new democratic era. Journal of Multi-Criteria Decision Analysis, 12(2-3): 163-176. http://dx.doi.org/10.1002/mcda.354

Zeleny, M. (1998). Multiple criteria decision making: eight concepts of optimality, Human Systems Management, 17(2): 97-107.

Dong, J.; Zhang, D.; Yan, H.; et al. (2005). Multitiered Supply Chain Networks: Multicriteria Decision Making Under Uncertainty. Annals of Operations Research, 135(1): 155-178. http://dx.doi.org/10.1007/s10479-005-6239-3

Gomes, E. G.; Lins, M. (2002). Integrating geographical information systems and multicriteria methods: A case study. Annals of Operations Research, 116(1-4): 243-269. http://dx.doi.org/10.1023/A:1021344700828

Bisdorff, R. (2002); Electre-like clustering from a pairwise fuzzy proximity index, European Journal of Operational Research, 138(2): 320-331. http://dx.doi.org/10.1016/S0377-2217(01)00249-1

Lenca, P.; Meyer, P.; Vaillant, B.; et al. (2008). On selecting interestingness measures for association rules: User oriented description and multiple criteria decision aid. European Journal of Operational Research, 184(2): 610-626. http://dx.doi.org/10.1016/j.ejor.2006.10.059

Malakooti, B.; Zhou, Y. Q. (1994). Feedforward Artificial Neural Networks for Solving Discrete Multiple Criteria Decision Making Problems, Management Science, 40(11): 1542- 1561. http://dx.doi.org/10.1287/mnsc.40.11.1542

Wang, J. (1994). A neural network approach to modeling fuzzy preference relations for multiple criteria decision making. Computers and Operations Research, 21(9): 991-1000. http://dx.doi.org/10.1016/0305-0548(94)90070-1

Published

2015-06-01

Most read articles by the same author(s)

Obs.: This plugin requires at least one statistics/report plugin to be enabled. If your statistics plugins provide more than one metric then please also select a main metric on the admin's site settings page and/or on the journal manager's settings pages.