Enhanced Dark Block Extraction Method Performed Automatically to Determine the Number of Clusters in Unlabeled Data Sets
Keywords:Enhanced DBE, Automatic clustering, Cluster tendency, Visual assessment, Reordered dissimilarity image.
AbstractOne of the major issues in data cluster analysis is to decide the number of clusters or groups from a set of unlabeled data. In addition, the presentation of cluster should be analyzed to provide the accuracy of clustering objects. This paper propose a new method called Enhanced-Dark Block Extraction (E-DBE), which automatically identifies the number of objects groups in unlabeled datasets. The proposed algorithm relies on the available algorithm for visual assessment of cluster tendency of a dataset, by using several common signal and image processing techniques. The method includes the following steps: 1.Generating an Enhanced Visual Assessment Tendency (E-VAT) image from a dissimilarity matrix which is the input for E-DBE algorithm. 2. Processing image segmentation on E-VAT image to obtain a binary image then performs filter techniques. 3. Performing distance transformation to the filtered binary image and projecting the pixels in the main diagonal alignment of the image to figure a projection signal. 4. Smoothing the outcrop signal, computing its first-order derivative and then detecting major peaks and valleys in the resulting signal to acquire the number of clusters. E-DBE is a parameter-free algorithm to perform cluster analysis. Experiments of the method are presented on several UCI, synthetic and real world datasets.
R. Xu and D. Wunsch II, Survey of Clustering Algorithms, IEEE Trans. Neural Networks, 16(3): 645-678,2005. http://dx.doi.org/10.1109/TNN.2005.845141
Shuliang Wang, Wenyan Gan, Deyi Li and Deren Li, Data Field for Hierarchical Clustering, Int J Data Warehousing and Mining, 7(4): 43-63, 2011. http://dx.doi.org/10.4018/jdwm.2011100103
A.K. Jain, and R.C. Dubes, Algorithms for Clustering Data, Englewood Cliffs, NJ: Prentice- Hall, 1988.
Ling Tan, David Taniar, Kate A. Smith, A clustering algorithm based on an estimated distribution model, Int. J. of Business Intelligent and Data Mining, 1(2): 229-245, 2005. http://dx.doi.org/10.1504/IJBIDM.2005.008364
R.B. Cattell, A Note on Correlation Clusters and Cluster Search Methods, Psychometrika, 9(3): 169-184, 1944. http://dx.doi.org/10.1007/BF02288721
P. Sneath, A Computer Approach to Numerical Taxonomy, J. General Microbiology, 17: 201-226, 1957. http://dx.doi.org/10.1099/00221287-17-1-201
G.D. Floodgate and P.R. Hayes, The Adansonian Taxonomy of Some Yellow Pigmented Marine Bacteria, J. General Microbiology, 30: 237-244, 1963. http://dx.doi.org/10.1099/00221287-30-2-237
R.F. Ling, A Computer Generated Aid for Cluster Analysis, Comm. ACM, 16: 355-361, 1973. http://dx.doi.org/10.1145/362248.362263
J.C. Bezdek and R. Hathaway, VAT: A Tool for Visual Assessment of (Cluster) Tendency, Proc. Int Joint Conf. Neural Networks (IJCNN '02), 2225-2230, 2002.
R.C. Gonzalez and R.E. Woods, Digital Image Processing, Prentice Hall, 2002.
Puniethaa Prabhu and K.Duraiswamy, Enhanced VAT for Cluster Quality Assessment in Unlabeled Datasets, J. of Circuits, Systems and Computers (JCSC), 21(1): 1-19, 2012.
I. Sledge, J. Huband, and J.C. Bezdek, (Automatic) Cluster Count Extraction from Unlabeled Datasets, Joint Proc. Fourth Int Conf. Natural Comput (ICNC) and Fifth Int Conf. Fuzzy Systems and Knowledge Discovery (FSKD), 2008.
G. Milligan and M. Cooper, An Examination of Procedures for Determining the Number of Clusters in a Data Set, Psychometrika, 50: 159-179, 1985. http://dx.doi.org/10.1007/BF02294245
R.B. Calinski and J. Harabasz, A Dendrite Method for Cluster Analysis, Comm. in Statistics, 3: 1-27, 1974.
R. Tibshirani, G. Walther, and T. Hastie, Estimating the Number of Clusters in a Dataset via the Gap Statistics, J. Royal Statistical Soc. B, 63: 411-423, 2001. http://dx.doi.org/10.1111/1467-9868.00293
U. Maulik and S. Bandyopadhyay, Performance Evaluation of Some Clustering Algorithms and Validity Indices, IEEE Trans. Pattern Analysis and Machine Intelligence, 24(12): 1650- 1654, 2002. http://dx.doi.org/10.1109/TPAMI.2002.1114856
J.C. Bezdek, W. Li, Y. Attikiouzel, and M.P. Windham, A Geometric Approach to Cluster Validity for Normal Mixtures, Soft Computing, 1: 166-179, 1997. http://dx.doi.org/10.1007/s005000050019
J.C. Bezdek and N.R. Pal, Some New Indices of Cluster Validity, IEEE Trans. System, Man and Cybernetics, 28(3): 301-315, 1998. http://dx.doi.org/10.1109/3477.678624
W. Wang and Y. Zhang, On Fuzzy Cluster Validity Indices, Fuzzy Sets and Systems, 158: 2095-2117, 2007. http://dx.doi.org/10.1016/j.fss.2007.03.004
Decomposition Methodology for Knowledge Discovery and Data Mining, O. Maimon and L. Rokach, eds., World Scientific, 90-94, 2005. http://dx.doi.org/10.1142/5686
P. Guo, C. Chen, and M. Lyu, Cluster Number Selection for aSmall Set of Samples Using the Bayesian Ying-Yang Model, IEEE Trans. Neural Networks, 13(3): 757-763, 2002. http://dx.doi.org/10.1109/TNN.2002.1000144
X. Hu and L. Xu, A Comparative Study of Several Cluster Number Selection Criteria, Proc. Fourth Int'l Conf. Intelligent Data Eng. and Automated Learning (IDEAL '03), 195-202, 2003.
P.J. Rousseeuw, A Graphical Aid to the Interpretations and Validation of Cluster Analysis, J. Comput. and Applied Math., 20: 53-65, 1987. http://dx.doi.org/10.1016/0377-0427(87)90125-7
Yun Sing Koh, Russel Pears and Gillian Dobbie, Automatic Item Weight Generation for Pattern Mining and its Application, Int. J. Data Warehousing and Mining, 7(3): 30-49, 2011. http://dx.doi.org/10.4018/jdwm.2011070102
Liang Wang, Christopher Leckie, Kotagiri Ramamohanarao and James Bezdek, Automatically Determining the Number of Clusters in Unlabeled Data Sets, IEEE Transactions on knowledge and Data Engineering, 21(3): 335-350, 2009. http://dx.doi.org/10.1109/TKDE.2008.158
N. Otsu, A Threshold Selection Method from Gray-level Histograms, IEEE Trans. Systems, Man, and Cybernetics, 9(1): 62-66, 1979. http://dx.doi.org/10.1109/TSMC.1979.4310076
Mehmet Sezgin and Bulent Sankur, Survey over image thresholding techniques and quantitative performance Evaluation, Journal of Electronic Imaging, 13(1): 146-165, 2004. http://dx.doi.org/10.1117/1.1631315
Amit Saxena and John Wang, Dimensionality Reduction with Unsupervised Feature Selection and Applying Non-Euclidean Norms for Classification Accuracy, Int J Data Warehousing and Mining, 6(2): 22-40, 2010. http://dx.doi.org/10.4018/jdwm.2010040102
A. Savitzky and M.J.E Golay, Smoothing and differentiation of data by simplified least squares. Procedures, Analytical Chemistry, 36(8): 1627-1639, 1964. http://dx.doi.org/10.1021/ac60214a047
UCI Repository of Machine Learning Databases, http://www.ics.uci.edu/mlearn/MLRepository.html.
ONLINE OPEN ACCES: Acces to full text of each article and each issue are allowed for free in respect of Attribution-NonCommercial 4.0 International (CC BY-NC 4.0.
You are free to:
-Share: copy and redistribute the material in any medium or format;
-Adapt: remix, transform, and build upon the material.
The licensor cannot revoke these freedoms as long as you follow the license terms.
DISCLAIMER: The author(s) of each article appearing in International Journal of Computers Communications & Control is/are solely responsible for the content thereof; the publication of an article shall not constitute or be deemed to constitute any representation by the Editors or Agora University Press that the data presented therein are original, correct or sufficient to support the conclusions reached or that the experiment design or methodology is adequate.