Data Dimensionality Reduction for Data Mining: A Combined Filter-Wrapper Framework
Keywords:
data mining, feature selection, filters, wrappersAbstract
Knowledge Discovery in Databases aims to extract new, interesting and potential useful patterns from large amounts of data. It is a complex process whose central point is data mining, which effectively builds models from data. Data type, quality and dimensionality are some factors which affect performance of data mining task. Since the high dimensionality of data can cause some troubles, as data overload, a possible solution could be its reduction. Sampling and filtering reduce the number of cases in a dataset, whereas features reduction can be achieved by feature selection. This paper aims to present a combined method for feature selection, where a filter based on correlation is applied on whole features set to find the relevant ones, and then, on these features a wrapper is applied in order to find the best features subset for a specified predictor. It is also presented a case study for a data set provided by TERAPERS a personalized speech therapy system.
References
Danubianu M., Pentiuc S.G., Tobolcea I., Schipor O.A., Advanced Information Technology - Support of Improved Personalized Therapy of Speech Disorders, INT J COMPUT COMMUN, ISSN 1841-9836, 5(5): 684-692, 2010.
Kohavi R., John G., Wrappers for feature subset selection, Artificial Intelligence, Special issue on relevance, 97(1-2):273-324, 1997.
Hall, M., Correlation-based feature selection for discrete and numeric class machine learning, Proc. of International Conference on Machine Learning, 359-365, Morgan Kaufmann, 2000.
Douik A., Abdellaoui M., Cereal Grain Classification by Optimal Features and Intelligent Classifiers, INT J COMPUT COMMUN, ISSN 1841-9836, 5(4):506-516, 2010.
Peng H. Long F., Ding C., Feature Selection based on mutual Information: Criteria of Max- Dependency, Max-Relevance and Min-Redundancy, IEEE Transaction on Pattern Analysis and Machine Intelligence, 27(8):1226 - 1238, 2005. http://dx.doi.org/10.1109/TPAMI.2005.159
John G.H., Kohavi R., Pfleger P., Irrelevant features and the subset selection problem, Machine Learning: Proceedings of the Eleventh International Conference, 121-129, Morgan Kaufman, 1994.
Gennari J.H., Langley P., Fisher D., Models of incremental concept formation, Artificial Intelligence, (40):11-16, 1989. http://dx.doi.org/10.1016/0004-3702(89)90046-5
Quinlan J.R., C4.5: Programs for Machine Learning, Morgan Kaufman, 1993.
Yu, L., Liu, H., Efficient Feature Selection via Analysis of Relevance and Redundancy, Journal of Machine Learning Research, 5:1205-1224, 2005 .
Danubianu M., Pentiuc St. Gh., Socaciu T., Towards the Optimized Personalized Therapy of Speech Disorders by Data Mining Techniques, The Fourth International Multi Conference on Computing in the Global Information Technology ICCGI 2009, Vol: CD, 23-29 August, Cannes - La Bocca, France, 2009.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I., The WEKA Data Mining Software: An Update, SIGKDD Explorations, 11(1):10-18, 2009. http://dx.doi.org/10.1145/1656274.1656278
Published
Issue
Section
License
ONLINE OPEN ACCES: Acces to full text of each article and each issue are allowed for free in respect of Attribution-NonCommercial 4.0 International (CC BY-NC 4.0.
You are free to:
-Share: copy and redistribute the material in any medium or format;
-Adapt: remix, transform, and build upon the material.
The licensor cannot revoke these freedoms as long as you follow the license terms.
DISCLAIMER: The author(s) of each article appearing in International Journal of Computers Communications & Control is/are solely responsible for the content thereof; the publication of an article shall not constitute or be deemed to constitute any representation by the Editors or Agora University Press that the data presented therein are original, correct or sufficient to support the conclusions reached or that the experiment design or methodology is adequate.