A Latent Dirichlet Allocation and Fuzzy Clustering Based Machine Learning Model for Text Thesaurus


  • Jia Luo
  • Dongwen Yu NEWHUADU Business School Minjiang, Fujian, China
  • Zong Dai Hunan Zhaoshan Investment & Holdings Co. Ltd. Xiangtan Hunan, China


text, LDA, fuzzy clustering, thesaurus, Word2vec, machine learning


It is not quite possible to use manual methods to process the huge amount of structured and semi-structured data. This study aims to solve the problem of processing huge data through machine learning algorithms. We collected the text data of the company’s public opinion through crawlers, and use Latent Dirichlet Allocation (LDA) algorithm to extract the keywords of the text, and uses fuzzy clustering to cluster the keywords to form different topics. The topic keywords will be used as a seed dictionary for new word discovery. In order to verify the efficiency of machine learning in new word discovery, algorithms based on association rules, N-Gram, PMI, andWord2vec were used for comparative testing of new word discovery. The experimental results show that the Word2vec algorithm based on machine learning model has the highest accuracy, recall and F-value indicators.


Adreevskaia, A.; Bergler, S. (2006). Mining wordnet for a fuzzy sentiment: Sentiment tag extraction from wordnet glosses. In 11th conference of the European chapter of the Association for Computational Linguistics, 2006.

Agerri, R.; García-Serrano, A. (2010, May). Q-WordNet: Extracting Polarity from WordNet Senses. In LREC, 2010.

Baccianella, S.; Esuli, A.; Sebastiani, F. (2010, May). Sentiwordnet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining. In Lrec (Vol. 10, No. 2010, pp. 2200-2204), 2010.

Blei, D.M.; Ng, A.Y.; Jordan, M.I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan): 993-1022, 2003,.

Chu, X.; Zhong, Q.; Li, X. (2018). Reverse channel selection decisions with a joint third-party recycler. International Journal of Production Research, 56 (18):5969-5981, 2018. https://doi.org/10.1080/00207543.2018.1442944

David, M.; Blei, J.; Lafferty, D. (2005) Correlated Topic Models// Advances in Neural Information Processing Systems 18 [Neural Information Processing Systems, NIPS 2005, December 5-8, 2005, Vancouver, British Columbia, Canada]. MIT Press, 2005.

D'Urso, P.; Leski, J.M. (2019). Fuzzy clustering of fuzzy data based on robust loss functions and ordered weighted averaging. Fuzzy Sets and Systems, 2019. https://doi.org/10.1016/j.fss.2019.03.017

Goldberg, Y.; Levy, O. (2014). Word2vec explained: deriving Mikolov et al.'s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722, 2014.

Gong, D.; Liu, S.; Liu, J.; Ren, L. (2019). Who benefits from online financing? A sharing economy E-tailing platform perspective, International Journal of Production Economics, DOI: 10.1016/j.ijpe.2019.09.011, 2019. https://doi.org/10.1016/j.ijpe.2019.09.011

Griffiths, T.L.; Jordan, M.I.; Tenenbaum, J.B., et al. (2004) Hierarchical topic models and the nested Chinese restaurant process//Advances in neural information processing systems, 17-24, 2004.

Griffiths, T.L.; Steyvers, M.; Blei, D.M., et al. (2005) Integrating topics and syntax//Advances in neural information processing systems, 537-544, 2005.

Hassan, A.; Radev, D. (2010, July). Identifying text polarity using random walks. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (pp. 395-403). Association for Computational Linguistics, 2010.

Hu, M.; Liu, B. (2004, August). Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 168-177). ACM, 2004. https://doi.org/10.1145/1014052.1014073

Li, L.; Li, W. (2019) Naive Bayesian Automatic Classification of Railway Service Complaint Text Based on Eigenvalue Extraction. Tehnicki vjesnik, 26(3): 778-785, 2019. https://doi.org/10.17559/TV-20190420161815

Mcauliffe, J.D; Blei, D.M. Supervised topic models//Advances in neural information processing systems. 121-128, 2008.

Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.

Mikolov, T.; Le, Q.V.; Sutskever, I. (2013). Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168, 2013.

Snellman, L. (2016). Social Entrepreneurship: Making change in the world. Journal of Logistics, Informatics and Service Science, 3(1), 1-25, 2016.

Wang, X; McCallum, A. (2006) Topics over time: a non-Markov continuous-time model of topical trends//Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 424-433, 2006. https://doi.org/10.1145/1150402.1150450

Wei, K., Gou, J., Chai, R., & Dai, W. (2013, September). Creation of customer evaluation model in the catering industry supply chain ecosystem. In 2013 5th International Conference on Intelligent Networking and Collaborative Systems (pp. 751-756). IEEE, 2013. https://doi.org/10.1109/INCoS.2013.144

Zhang, Q.; Liu, S.; Gong, D.; Tu, Q. (2019). A Latent-Dirichlet-Allocation Based Extension for Domain Ontology of Enterprise's Technological Innovation. International Journal of Computers Communications & Control, Vol. 14, No.1, pp.107-123, 2019. https://doi.org/10.15837/ijccc.2019.1.3366

Zhang, D. (2017). High-speed train control system big data analysis based on the fuzzy rdf model and uncertain reasoning. International Journal of Computers Communications & Control, 12(4), 577-591, 2017. https://doi.org/10.15837/ijccc.2017.4.2914

Zhang, D.; Sui, J.; Gong, Y. (2017). Large scale software test data generation based on collective constraint and weighted combination method. Tehnicki vjesnik, 24(4), 1041-1050, 2017. https://doi.org/10.17559/TV-20170319045945



Most read articles by the same author(s)

Obs.: This plugin requires at least one statistics/report plugin to be enabled. If your statistics plugins provide more than one metric then please also select a main metric on the admin's site settings page and/or on the journal manager's settings pages.