High Performance Computing Systems with Various Checkpointing Schemes

Nichamon Naksinehaboon, Mihaela P[un, Raja Nassar, Chokchai Box Leangsuksun, Stephen Scott

Abstract


Finding the failure rate of a system is a crucial step in high performance computing systems analysis. To deal with this problem, a fault tolerant mechanism, called checkpoint/ restart technique, was introduced. However, there are additional costs to perform this mechanism. Thus, we propose two models for different schemes (full and incremental checkpoint schemes). The models which are based on the reliability of the system are used to determine the checkpoint placements. Both proposed models consider a balance of between checkpoint overhead and the re-computing time. Due to the extra costs from each incremental checkpoint during the recovery period, a method to find the number of incremental checkpoints between two consecutive full checkpoints is given. Our simulation suggests that in most cases our incremental checkpoint model can reduce the waste time more than it is reduced by the full checkpoint model. The waste times produced by both models are in the range of 2% to 28% of the application completion time depending on the checkpoint overheads.

Keywords


Large-scale distributed system, reliability, fault-tolerance, checkpoint/restart model, HPC

Full Text:

PDF

References


A.R. Adiga, G Almasi, and et al., An overview of the BlueGene/L supercomputer, Proceedings of Supercomputing, IEEE/ACM Conference, pp. 60-60, 2002.

J.T. Daly, A Model for Predicting the Optimum Checkpoint Interval for Restart Dumps, ICCS 2003, LNCS 2660, Volume 4, pp. 3–12, 2003.

J.T. Daly, A Higher Order Estimate of the Optimum Checkpoint Interval for Restart Dumps, Future Generation Computer Systems, Elsevier, Amsterdam, 2004.

E. Elnozahy, J. Plank, Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback- Recovery, IEEE Transactions on Dependable and Secure Computing, vol.01,no.2, pp. 97-108, 2004.
http://dx.doi.org/10.1109/TDSC.2004.15

R. Geist, R. Reynolds, and J. Westall, Selection of a checkpoint interval in a critical-task environment, IEEE Transactions on Reliability, vol.37, no.4, pp. 395-400, 1988.
http://dx.doi.org/10.1109/24.9847

I. M. Gelfand and S. V. Fomin, Calculus of Variations, Dover Publications, ISBN-10: 0486414485, 2000.

L. Hancu, Data-Mining Techniques for Supporting Merging Decisions,Int. J. of Computers, Communications and Control,Vol. III (2008), pp. 322-326.

D. I. Hunyadi, M. A. Musan, Modelling of the Distributed Databases. A Viewpoint Mechanism of the MVDB Model's Methodology, Int. J. of Computers, Communications and Control, Vol. III (2008),pp. 327-332.

Y. Ling, J. Mi, and X. Lin, A Variational Calculus Approach to Optimal Checkpoint Placement, IEEE Transactions on Computers, vol. 50, no. 7, pp. 699-707, 2001.
http://dx.doi.org/10.1109/12.936236

A.J. Oliner, L. Rudolph, and R.K. Sahoo, Cooperative Checkpointing: A Robust Approach to Large-scale Systems Reliability, Proceedings of the 20th Annual International Conference on Supercomputing (ICS), Australia, pp. 14-23, 2006.
http://dx.doi.org/10.1145/1183401.1183406

T. Ozaki, T. Dohi, and H. Okamura, Distribution-Free Checkpoint Placement Algorithms Based on Min-Max Principle, IEEE Transactions on Dependable and Secure Computing, Volume 3, Issue 2, pp. 130 – 140, 2006.
http://dx.doi.org/10.1109/TDSC.2006.22

A. C. Palaniswamy, and P. A. Wilsey, An analytical comparison of periodic checkpointing and incremental state saving, Proc. of the Seventh Workshop on Parallel and Distributed Simulation, California, pp. 127-134, 1993.
http://dx.doi.org/10.1145/158459.158475

J.S. Plank, J. Xu, and R.H. Netzer, 1995a. Compressed differences: an algorithm for fast incremental checkpointing, Technical Report CS-95-302, University of Tennessee at Knoxville, 1995.

J.S. Plank, M.A. Thomason, The Average Availability of Parallel Checkpointing Systems and Its Importance in Selecting Runtime Parameters, The 29th International Symposium on Fault-Tolerant Computing, Madison, WI, pp. 250-259, 1999.

S.M. Ross, Stochastic Processes, Wiley; 2nd edition, ISBN-10: 0471120626, 1995.

B. Schroeder, G.A. Gibson, A large-scale study of failures in high-performance computing systems, Proceedings of International Symposium on Dependable Systems and Networks (DSN). IEEE Computer Society, pp. 249–258, 2006.
http://dx.doi.org/10.1109/dsn.2006.5

A. Tikotekar, C. Leangsuksun, S. Scott, On the survivability of standard MPI applications, In Proceedings of 7th LCI International Conference on Linux Clusters: The HPC Revolution 2006.

N. H. Vaidya, A case for two-level distributed recovery schemes, in Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems, pp. 64–73, 1995.
http://dx.doi.org/10.1145/223587.223596

K. F. Wong, M.A. Franklin, Distributed Computing Systems and Checkpointing, HPDC, pp. 224-233, 1993.
http://dx.doi.org/10.1109/hpdc.1993.263838

J.W. Young, A first-order approximation to the optimum checkpoint interval, Communications of ACM volume 17, Issue 9, pp. 530-531, 1974.
http://dx.doi.org/10.1145/361147.361115

Y. Liu, R. Nassar, C. B. Leangsuksun, N. Naksinehaboon, M. Paun, and S. L. Scott, An optimal checkpoint/ restart model for a large scale high performance computing system, in Proc. International Parallel and Distributed Processing Symposium, (Miami, Florida, 2008) pp.1-9, 2008.

D. Zmaranda, G. Gabor, Issues on Optimality Criteria Applied in Real-Time Scheduling,Int. J. of Computers, Communications and Control, Issue 3, pp.536-540, 2008.




DOI: https://doi.org/10.15837/ijccc.2009.4.2455



Copyright (c) 2017 Nichamon Naksinehaboon, Mihaela P[un, Raja Nassar, Chokchai Box Leangsuksun, Stephen Scott

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

CC-BY-NC  License for Website User

Articles published in IJCCC user license are protected by copyright.

Users can access, download, copy, translate the IJCCC articles for non-commercial purposes provided that users, but cannot redistribute, display or adapt:

  • Cite the article using an appropriate bibliographic citation: author(s), article title, journal, volume, issue, page numbers, year of publication, DOI, and the link to the definitive published version on IJCCC website;
  • Maintain the integrity of the IJCCC article;
  • Retain the copyright notices and links to these terms and conditions so it is clear to other users what can and what cannot be done with the  article;
  • Ensure that, for any content in the IJCCC article that is identified as belonging to a third party, any re-use complies with the copyright policies of that third party;
  • Any translations must prominently display the statement: "This is an unofficial translation of an article that appeared in IJCCC. Agora University  has not endorsed this translation."

This is a non commercial license where the use of published articles for commercial purposes is forbiden. 

Commercial purposes include: 

  • Copying or downloading IJCCC articles, or linking to such postings, for further redistribution, sale or licensing, for a fee;
  • Copying, downloading or posting by a site or service that incorporates advertising with such content;
  • The inclusion or incorporation of article content in other works or services (other than normal quotations with an appropriate citation) that is then available for sale or licensing, for a fee;
  • Use of IJCCC articles or article content (other than normal quotations with appropriate citation) by for-profit organizations for promotional purposes, whether for a fee or otherwise;
  • Use for the purposes of monetary reward by means of sale, resale, license, loan, transfer or other form of commercial exploitation;

    The licensor cannot revoke these freedoms as long as you follow the license terms.

[End of CC-BY-NC  License for Website User]


INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL (IJCCC), With Emphasis on the Integration of Three Technologies (C & C & C),  ISSN 1841-9836.

IJCCC was founded in 2006,  at Agora University, by  Ioan DZITAC (Editor-in-Chief),  Florin Gheorghe FILIP (Editor-in-Chief), and  Misu-Jan MANOLESCU (Managing Editor).

Ethics: This journal is a member of, and subscribes to the principles of, the Committee on Publication Ethics (COPE).

Ioan  DZITAC (Editor-in-Chief) at COPE European Seminar, Bruxelles, 2015:

IJCCC is covered/indexed/abstracted in Science Citation Index Expanded (since vol.1(S),  2006); JCR2018: IF=1.585..

IJCCC is indexed in Scopus from 2008 (CiteScore2018 = 1.56):

Nomination by Elsevier for Journal Excellence Award Romania 2015 (SNIP2014 = 1.029): Elsevier/ Scopus

IJCCC was nominated by Elsevier for Journal Excellence Award - "Scopus Awards Romania 2015" (SNIP2014 = 1.029).

IJCCC is in Top 3 of 157 Romanian journals indexed by Scopus (in all fields) and No.1 in Computer Science field by Elsevier/ Scopus.

 

 Impact Factor in JCR2018 (Clarivate Analytics/SCI Expanded/ISI Web of Science): IF=1.585 (Q3). Scopus: CiteScore2018=1.56 (Q2);

SCImago Journal & Country Rank

Editors-in-Chief: Ioan DZITAC & Florin Gheorghe FILIP.