Heterogeneous Data Clustering Considering Multiple User-provided Constraints

Yue Huang

Abstract


Clustering on heterogeneous networks which consist of multi-typed objects and links has proved to be a useful technique in many scenarios. Although numerous clustering methods have achieved remarkable success, current clustering methods for heterogeneous networks tend to consider only internal information of the dataset. In order to utilize background domain knowledge, we propose a general framework for clustering heterogeneous data considering multiple user-provided constrains. Specifically, we summarize that three types of manual constraints on the object can be used to guide the clustering process. Then we propose the User- HeteClus algorithm to solve the key issues in the case of star-structure heterogeneous data, which incorporating the user constraint into similarity measurement between central objects. Experiments on a real-world dataset show the effectiveness of the proposed algorithm.

Keywords


clustering, heterogeneous networks, relational data, multi-typed objects, user constraints

Full Text:

PDF

References


Banerjee, A.; Dhillon, I.S.; Ghosh, J. Merugu S.; Modha, D.S. (2004). A generalized maximum entropy approach to Bregman co-clustering and matrix approximation, Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 509-514, 2004.
https://doi.org/10.1145/1014052.1014111

Bekkerman, R.; El-Yaniv, R.; McCallum, A. (2005). Multi-way distributional clustering via pairwise interactions, Proceedings of the 22nd International Conference on Machine Learning, 41-48, 2005.
https://doi.org/10.1145/1102351.1102357

Chen, Y.; Wang, L.; Dong, M. (2010); Non-negative matrix factorization for semisupervised heterogeneous data coclustering, IEEE Transactions on Knowledge and Data Engineering 22(10), 1459-1474, 2010.
https://doi.org/10.1109/TKDE.2009.169

Dai, Y.; Wu, W.; Zhou, H.; Zhang, J; Ma, F. (2018). Numerical simulation and optimization of oil jet lubrication for rotorcraft meshing gears, International Journal of Simulation Modelling, 17(2), 318-326, 2018.
https://doi.org/10.2507/IJSIMM17(2)CO6

Dai, Y.; Zhu, X.; Zhou, H.; Mao, Z.; Wu, W. (2018). Trajectory tracking control for seafloor tracked vehicle by adaptive neural-fuzzy inference system algorithm, International Journal of Computers Communications & Control, 13(4), 465-476, 2018.
https://doi.org/10.15837/ijccc.2018.4.3267

Dhillon, I.S. (2001). Co-clustering documents and words using bipartite spectral graph partitioning, Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 269-274, 2001.
https://doi.org/10.1145/502512.502550

Dhillon, I.S.; Mallela, S.; Modha, D.S. (2003). Information-theoretic co-clustering, Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 89-98, 2003.
https://doi.org/10.1145/956750.956764

Gao, B.; Liu, T.; Ma, W. (2006). Star-structured high-order heterogeneous data co-clustering based on consistent information theory, Proceedings of the Sixth IEEE International Conference on Data Mining, 880-884, 2006.
https://doi.org/10.1109/ICDM.2006.154

Getz, G.; Levine, E.; Domany, E. (2000). Coupled two-way clustering analysis of gene microarray data, Proceedings of the National Academy of Sciences, 97(22), 12079-12084, 2000.
https://doi.org/10.1073/pnas.210134797

Han, J.; Kamber, M.; Pei, J. (2012). Data Mining: Concepts and Techniques (Third Edition), Morgan Kaufmann Publishers, 2012.

Huang, Y. (2016). A three-phase algorithm for clustering multi-typed objects in starstructured heterogeneous data, International Journal of Database Theory and Application, 9(8), 107-118, 2016.
https://doi.org/10.14257/ijdta.2016.9.8.12

Huang, Y. (2017). Clustering multi-typed objects in extended star-structured heterogeneous data, Intelligent Data Analysis, 21(2), 225-241, 2017.
https://doi.org/10.3233/IDA-150416

Huang, Y.; Gao, X. (2014). Clustering on heterogeneous networks, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(3), 213-233, 2014.
https://doi.org/10.1002/widm.1126

Ienco, D.; Robardet, C.; Pensa, R.G.; Meo, R. (2013). Parameter-less co-clustering for star-structured heterogeneous data, Data Mining and Knowledge Discovery, 26(2), 217-254, 2013.
https://doi.org/10.1007/s10618-012-0248-z

Long, B.; Zhang, Z.; Wu, X.; Yu, P.S. (2006). Spectral clustering for multi-type relational data, Proceedings of the 23rd International Conference on Machine Learning, 585-592, 2006.
https://doi.org/10.1145/1143844.1143918

Mei, J.; Chen, L. (2012). A fuzzy approach for multitype relational data clustering, IEEE Transactions on Fuzzy Systems, 20(2), 358-371, 2012.
https://doi.org/10.1109/TFUZZ.2011.2174366

Pio, G.; Serafino, F.; Malerba, D.; Ceci, M. (2018). Multi-type clustering and classification from heterogeneous networks, Information Sciences, 425, 107-126, 2018.
https://doi.org/10.1016/j.ins.2017.10.021

Rege, M.; Yu, Q. (2008). Efficient mining of heterogeneous star-structured data, International Journal of Software and Informatics, 2(2), 141-161, 2008.

Sun, Y.; Han, J.; Zhao, P.; Yin, Z.; Cheng, H.; Wu, T. (2009). RankClus: integrating clustering with ranking for heterogeneous information network analysis, Proceedings of the 12nd International Conference on Extending Database Technology: Advances in Database Technology, 565-576, 2009.
https://doi.org/10.1145/1516360.1516426

Sun, Y.; Yu, Y.; Han, J. (2009). Ranking-based clustering of heterogeneous information networks with star network schema, Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 797-806, 2009.
https://doi.org/10.1145/1557019.1557107

Tang, L.; Liu, H. (2009). Uncovering cross-dimension group structures in multi-dimensional networks, Proceedings of SDM Workshop on Analysis of Dynamic Networks, 677-685, 2009.

Tang, L.; Liu, H.; Zhang, J. (2012). Identifying evolving groups in dynamic multimode networks, IEEE Transactions on Knowledge and Data Engineering, 24(1), 72-85, 2012.
https://doi.org/10.1109/TKDE.2011.159

Wagstaff, K.; Cardie, C. (2000). Clustering with instance-level constraints, Proceedings of the 17th International Conference on Machine Learning, 1103-1110, 2000.

Yin, X.; Han, J.; Yu, P.S. (2006). LinkClus: efficient clustering via heterogeneous semantic links, Proceedings of the 32nd International Conference on Very Large Data Bases, 427-438, 2006.

Zhang, W.; Zhang, Z.; Chao, H.; Tseng, F. (2018). Kernel mixture model for probability density estimation in Bayesian classifiers, Data Mining and Knowledge Discovery, 32(3), 675-707, 2018.
https://doi.org/10.1007/s10618-018-0550-5

Zhang, W.; Zhang, Z.; Qi, D.; Liu, Y. (2014). Automatic crack detection and classification method for subway tunnel safety monitoring, Sensors, 14(10), 19307-19328, 2014.
https://doi.org/10.3390/s141019307




DOI: https://doi.org/10.15837/ijccc.2019.2.3419



Copyright (c) 2019 Yue Huang

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

CC-BY-NC  License for Website User

Articles published in IJCCC user license are protected by copyright.

Users can access, download, copy, translate the IJCCC articles for non-commercial purposes provided that users, but cannot redistribute, display or adapt:

  • Cite the article using an appropriate bibliographic citation: author(s), article title, journal, volume, issue, page numbers, year of publication, DOI, and the link to the definitive published version on IJCCC website;
  • Maintain the integrity of the IJCCC article;
  • Retain the copyright notices and links to these terms and conditions so it is clear to other users what can and what cannot be done with the  article;
  • Ensure that, for any content in the IJCCC article that is identified as belonging to a third party, any re-use complies with the copyright policies of that third party;
  • Any translations must prominently display the statement: "This is an unofficial translation of an article that appeared in IJCCC. Agora University  has not endorsed this translation."

This is a non commercial license where the use of published articles for commercial purposes is forbiden. 

Commercial purposes include: 

  • Copying or downloading IJCCC articles, or linking to such postings, for further redistribution, sale or licensing, for a fee;
  • Copying, downloading or posting by a site or service that incorporates advertising with such content;
  • The inclusion or incorporation of article content in other works or services (other than normal quotations with an appropriate citation) that is then available for sale or licensing, for a fee;
  • Use of IJCCC articles or article content (other than normal quotations with appropriate citation) by for-profit organizations for promotional purposes, whether for a fee or otherwise;
  • Use for the purposes of monetary reward by means of sale, resale, license, loan, transfer or other form of commercial exploitation;

    The licensor cannot revoke these freedoms as long as you follow the license terms.

[End of CC-BY-NC  License for Website User]


INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL (IJCCC), With Emphasis on the Integration of Three Technologies (C & C & C),  ISSN 1841-9836.

IJCCC was founded in 2006,  at Agora University, by  Ioan DZITAC (Editor-in-Chief),  Florin Gheorghe FILIP (Editor-in-Chief), and  Misu-Jan MANOLESCU (Managing Editor).

Ethics: This journal is a member of, and subscribes to the principles of, the Committee on Publication Ethics (COPE).

Ioan  DZITAC (Editor-in-Chief) at COPE European Seminar, Bruxelles, 2015:

IJCCC is covered/indexed/abstracted in Science Citation Index Expanded (since vol.1(S),  2006); JCR2018: IF=1.585..

IJCCC is indexed in Scopus from 2008 (CiteScore2018 = 1.56):

Nomination by Elsevier for Journal Excellence Award Romania 2015 (SNIP2014 = 1.029): Elsevier/ Scopus

IJCCC was nominated by Elsevier for Journal Excellence Award - "Scopus Awards Romania 2015" (SNIP2014 = 1.029).

IJCCC is in Top 3 of 157 Romanian journals indexed by Scopus (in all fields) and No.1 in Computer Science field by Elsevier/ Scopus.

 

 Impact Factor in JCR2018 (Clarivate Analytics/SCI Expanded/ISI Web of Science): IF=1.585 (Q3). Scopus: CiteScore2018=1.56 (Q2); Editors-in-Chief: Ioan DZITAC & Florin Gheorghe FILIP.