Performing MapReduce on Data Centers with Hierarchical Structures

Authors

  • Zeliu Ding Key Lab of Information System Engineering, School of Information Systems and Management, National University of Defense Technology, Changsha 410073, China
  • Deke Guo Key Lab of Information System Engineering, School of Information Systems and Management, National University of Defense Technology, Changsha 410073, China
  • Xueshan Luo Key Lab of Information System Engineering, School of Information Systems and Management, National University of Defense Technology, Changsha 410073, China
  • Xi Chen School of Computer Science, McGill University, Montreal H3A 2A7, Canada

Keywords:

MapReduce, Data Center, distributed hash table (DHT)

Abstract

Data centers are created as distributed information systems for massive data storage and processing. The structure of a data center determines the way that its inner servers, links and switches are interconnected. Several hierarchical structures have been proposed to improve the topological performance of data centers. By using recursively defined topologies, these novel structures can well support general applications and services with high scalability and reliability. However, these structures ignore the details of some specific applications running on data centers, such as MapReduce, a well-known distributed data processing application. The communication and control mechanisms for performing MapReduce on the traditional structure cannot be employed on the hierarchical structures. In this paper, we propose a methodology for performing MapReduce on data centers with hierarchical structures. Our methodology is based on the distributed hash table (DHT), an efficient data retrieval approach on distributed systems. We utilize the advantages of DHT, including decentralization, fault tolerance and scalability, to address the main problems that face hierarchical data centers in supporting MapReduce. Comprehensive evaluation demonstrates the feasibility and excellent performance of our methodology.

Author Biography

Zeliu Ding, Key Lab of Information System Engineering, School of Information Systems and Management, National University of Defense Technology, Changsha 410073, China

Department of Mathematics and Computer Science

References

M. Al-Fares, A. Loukissas, and A. Vahdat. A Scalable, Commodity Data Center Network Architecture. Proc. ACM SIGCOMM, pp.63-74, Aug. 2008.

D. Borthakur. The Hadoop Distributed File System: Architecture and Design. http://hadoop.apache.org/core/docs/current/hdfsdesign.pdf

C. Bastoul and P. Feautrier. Improving Data Locality by Chunking. Springer Lecture Notes in Computer Science, vol.2622, pp.320-334, 2003.

F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R.E.Gruber. Bigtable: A Distributed Storage System for Structured Data. Proc. 7th Symposium on Operating Systems Design and Implementation (OSDI), pp.205-218, Nov. 2006.

J. Cohen. Graph Twiddling in a MapReduce world. Computing in Science and Engineering, IEEE Educational Activities Department, vol.2, no.4, pp.29-41, 2009.

J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Proc. 6th Symposium on Operating System Design and Implementation (OSDI), pp.137-150, Dec. 2004.

J. Dean, and S. Ghemawat. MapReduce: A Flexible Data Processing Tool. Communications of the ACM, vol.53, no.1, pp.72-77, 2010. http://dx.doi.org/10.1145/1629175.1629198

A. Greenberg, J. Hamilton, D. A. Maltz, and P. Patel. The Cost of a Cloud: Research Problems in Data Center Networks. ACM SIGCOMM computer communication review, vol.39, no.1, pp.68-73, Jan. 2009. http://dx.doi.org/10.1145/1496091.1496103

C. Guo, H. Wu, K. Tan, L. Shi, Y. Zhang, and S. Lu. DCell: A Scalable and Fault-Tolerant Network Structure for Data Centers. Proc. ACM SIGCOMM, pp.75-86, Aug. 2008.

A. Greenberg, J.R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D.A. Maltz, P. Patel, and S. Sengupta. VL2: A Scalable and Flexible Data Center Network. ACM SIGCOMM Computer Communication Review, vol.39, no.4, pp.51-62, Aug. 2009. http://dx.doi.org/10.1145/1594977.1592576

C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, and S. Lu. BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers. Proc. ACM SIGCOMM, pp.63-74, Aug. 2009.

S. Ghemawat, H. Gobioff, and S.T. Leung. The Google File System. Proc. 19th ACM Symposium on Operating Systems Principles, pp.29-43, Dec. 2003.

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed Data-parallel programs from Sequential Building Blocks. Proc. 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems, pp.59-72, Jun. 2007.

W. Jun. A Methodology for the Deployment of Consistent Hashing Proc. 2nd IEEE International Conference on Future Networks, Jan. 2010.

D. Li, C. Guo, H. Wu, K. Tan, Y. Zhang, and S. Lu. FiConn: Using Backup Port for Server Interconnection in Data Centers. Proc. IEEE INFOCOM, pp.2276-2285, Apr. 2009.

J. Lin. The Curse of Zipf and Limits to Parallelization: A Look at the Stragglers Problem in MapReduce Workshop on Large-Scale Distributed Systems for Information Retrieval, Jul. 2009.

J. Pang, P.B. Gibbons, M. Kaminsky, S. Seshan, and H. Yu. Defragmenting DHT-based Distributed File Systems Proc. 27th IEEE International Conference on Distributed Computing Systems, Jun. 2007.

T. Redkar. Introducing Cloud Services. Windows Azure Platform, Apress, pp.1-51, 2009. http://dx.doi.org/10.1007/978-1-4302-2480-8_1

L. Rao, X. Liu, L. Xie, and W. Liu. Minimizing Electricity Cost: Optimization of Distributed Internet Data Centers in a Multi-Electricity-Market Environment Proc. IEEE INFOCOM, Mar. 2010.

I. Stoica, R. Morris, D. Karger, M.F. Kaashoek, and H. Balakrishnan. Chord: A Scalable Peertopeer Lookup Service for Internet Applications Proc. ACM SIGCOMM, pp.1-12, Aug. 2001.

D. Talia and P. Trunfio. Enabling Dynamic Querying over Distributed Hash Tables. Elsevier Journal of Parallel and Distributed Computing, vol.70, no.12, pp.1254-1265, 2010. http://dx.doi.org/10.1016/j.jpdc.2010.08.012

G. Urdaneta, G. Pierre and M.V. Steen. A Survey of DHT Security Techniques. Journal of ACM Computing Surveys, vol.43, no.2, pp.1-49, 2011. http://dx.doi.org/10.1145/1883612.1883615

X.Wang and D. Loguinov. Load-balancing performance of consistent hashing: asymptotic analysis of random node join IEEE/ACM Transactions on Networking, vol.15, no.4, pp.892-905, 2007. http://dx.doi.org/10.1109/TNET.2007.893881

http://hadoop.apache.org.

Published

2014-09-18

Most read articles by the same author(s)

Obs.: This plugin requires at least one statistics/report plugin to be enabled. If your statistics plugins provide more than one metric then please also select a main metric on the admin's site settings page and/or on the journal manager's settings pages.