Data Dimensionality Reduction for Data Mining: A Combined Filter-Wrapper Framework

Authors

  • Mirela Danubianu "Stefan cel Mare" University of Suceava
  • Stefan Gheorghe Pentiuc "Stefan cel Mare" University of Suceava
  • Dragos Mircea Danubianu "Stefan cel Mare" University of Suceava

Keywords:

data mining, feature selection, filters, wrappers

Abstract

Knowledge Discovery in Databases aims to extract new, interesting and potential useful patterns from large amounts of data. It is a complex process whose central point is data mining, which effectively builds models from data. Data type, quality and dimensionality are some factors which affect performance of data mining task. Since the high dimensionality of data can cause some troubles, as data overload, a possible solution could be its reduction. Sampling and filtering reduce the number of cases in a dataset, whereas features reduction can be achieved by feature selection. This paper aims to present a combined method for feature selection, where a filter based on correlation is applied on whole features set to find the relevant ones, and then, on these features a wrapper is applied in order to find the best features subset for a specified predictor. It is also presented a case study for a data set provided by TERAPERS a personalized speech therapy system.

Author Biography

Mirela Danubianu, "Stefan cel Mare" University of Suceava

Department of Mathematics and Computer Science

References

Danubianu M., Pentiuc S.G., Tobolcea I., Schipor O.A., Advanced Information Technology - Support of Improved Personalized Therapy of Speech Disorders, INT J COMPUT COMMUN, ISSN 1841-9836, 5(5): 684-692, 2010.

Kohavi R., John G., Wrappers for feature subset selection, Artificial Intelligence, Special issue on relevance, 97(1-2):273-324, 1997.

Hall, M., Correlation-based feature selection for discrete and numeric class machine learning, Proc. of International Conference on Machine Learning, 359-365, Morgan Kaufmann, 2000.

Douik A., Abdellaoui M., Cereal Grain Classification by Optimal Features and Intelligent Classifiers, INT J COMPUT COMMUN, ISSN 1841-9836, 5(4):506-516, 2010.

Peng H. Long F., Ding C., Feature Selection based on mutual Information: Criteria of Max- Dependency, Max-Relevance and Min-Redundancy, IEEE Transaction on Pattern Analysis and Machine Intelligence, 27(8):1226 - 1238, 2005. http://dx.doi.org/10.1109/TPAMI.2005.159

John G.H., Kohavi R., Pfleger P., Irrelevant features and the subset selection problem, Machine Learning: Proceedings of the Eleventh International Conference, 121-129, Morgan Kaufman, 1994.

Gennari J.H., Langley P., Fisher D., Models of incremental concept formation, Artificial Intelligence, (40):11-16, 1989. http://dx.doi.org/10.1016/0004-3702(89)90046-5

Quinlan J.R., C4.5: Programs for Machine Learning, Morgan Kaufman, 1993.

Yu, L., Liu, H., Efficient Feature Selection via Analysis of Relevance and Redundancy, Journal of Machine Learning Research, 5:1205-1224, 2005 .

Danubianu M., Pentiuc St. Gh., Socaciu T., Towards the Optimized Personalized Therapy of Speech Disorders by Data Mining Techniques, The Fourth International Multi Conference on Computing in the Global Information Technology ICCGI 2009, Vol: CD, 23-29 August, Cannes - La Bocca, France, 2009.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I., The WEKA Data Mining Software: An Update, SIGKDD Explorations, 11(1):10-18, 2009. http://dx.doi.org/10.1145/1656274.1656278

Published

2014-09-13

Most read articles by the same author(s)

Obs.: This plugin requires at least one statistics/report plugin to be enabled. If your statistics plugins provide more than one metric then please also select a main metric on the admin's site settings page and/or on the journal manager's settings pages.