• Previous Article
    Optimal liability ratio and dividend payment strategies under catastrophic risk
  • JIMO Home
  • This Issue
  • Next Article
    Compensation plan, pricing and production decisions with inventory-dependent salvage value, and asymmetric risk-averse sales agent
October  2018, 14(4): 1423-1442. doi: 10.3934/jimo.2018014

Performance optimization of parallel-distributed processing with checkpointing for cloud environment

1. 

Graduate School of Informatics, Kyoto University, Yoshida-Hommachi, Sakyo-ku, Kyoto 606-8501, Japan

2. 

Graduate School of Information Science, Nara Institute of Science and Technology, 8916-5 Takayama, Ikoma, Nara 630-0192, Japan

Received  December 2016 Revised  August 2017 Published  January 2018

In cloud computing, the most successful application framework is parallel-distributed processing, in which an enormous task is split into a number of subtasks and those are processed independently on a cluster of machines referred to as workers. Due to its huge system scale, worker failures occur frequently in cloud environment and failed subtasks cause a large processing delay of the task. One of schemes to alleviate the impact of failures is checkpointing method, with which the progress of a subtask is recorded as checkpoint and the failed subtask is resumed by other worker from the latest checkpoint. This method can reduce the processing delay of the task. However, frequent checkpointing is system overhead and hence the checkpoint interval must be set properly. In this paper, we consider the optimal number of checkpoints which minimizes the task-processing time. We construct a stochastic model of parallel-distributed processing with checkpointing and approximately derive explicit expressions for the mean task-processing time and the optimal number of checkpoints. Numerical experiments reveal that the proposed approximations are sufficiently accurate on typical environment of cloud computing. Furthermore, the derived optimal number of checkpoints outperforms the result of previous study for minimizing the task-processing time on parallel-distributed processing.

Citation: Tsuguhito Hirai, Hiroyuki Masuyama, Shoji Kasahara, Yutaka Takahashi. Performance optimization of parallel-distributed processing with checkpointing for cloud environment. Journal of Industrial & Management Optimization, 2018, 14 (4) : 1423-1442. doi: 10.3934/jimo.2018014
References:
[1]

L. A. Barroso and U. Hölzle, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Morgan & Claypool Publishers, California, 2009. doi: 10.2200/S00193ED1V01Y200905CAC006.  Google Scholar

[2]

T. C. Bressoud and M. A. Kozuch, Cluster fault-tolerance: An experimental evaluation of checkpointing and MapReduce through simulation in Proc. the IEEE International Conference on Cluster Computing and Workshops (CLUSTER 2009), (2009). doi: 10.1109/CLUSTR.2009.5289185.  Google Scholar

[3]

C. L. P. Chen and C.-Y. Zhang, Data-intensive applications, challenges, techniques and technologies: A survey on big data, Information Sciences, 275 (2014), 314-347.  doi: 10.1016/j.ins.2014.01.015.  Google Scholar

[4]

T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy and R. Sears, MapReduce online, in Proc. the 7th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2010), (2010). Google Scholar

[5]

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, 22 (2006), 303-312.  doi: 10.1016/j.future.2004.11.016.  Google Scholar

[6]

J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters, Communications of the ACM, 51 (2008), 107-113.  doi: 10.1145/1327452.1327492.  Google Scholar

[7]

J. Dean, Designs, lessons and advice from building large distributed systems, in Keynote Presentation of the 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware (LADIS 2009), (2009). Google Scholar

[8]

J. Dean and S. Ghemawat, MapReduce: A flexible data processing tool, Communications of the ACM, 53 (2010), 72-77.  doi: 10.1145/1629175.1629198.  Google Scholar

[9]

S. Di, Y. Robert, F. Vivien, D. Kondo, C. -L. Wang and F. Cappello, Optimization of cloud task processing with checkpoint-restart mechanism in Proc. the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 13), (2013). doi: 10.1145/2503210.2503217.  Google Scholar

[10]

L. FialhoD. Rexachs and E. Luque, What is missing in current checkpoint interval models?, Proc. the 31st International Conference on Distributed Computing Systems (ICDCS 2011), (2011), 322-332.  doi: 10.1109/ICDCS.2011.12.  Google Scholar

[11]

B. JavadiD. KondoA. Iosup and D. Epema, The failure trace archive: Enabling the comparison of failure measurements and models of distributed systems, Journal of Parallel and Distributed Computing, 73 (2013), 1208-1223.  doi: 10.1016/j.jpdc.2013.04.002.  Google Scholar

[12]

H. JinY. ChenH. Zhu and X.-H. Sun, Optimizing HPC fault-tolerant environment: An analytical approach, Proc. the 39th International Conference on Parallel Processing (ICPP 2010), (2010), 525-534.  doi: 10.1109/ICPP.2010.80.  Google Scholar

[13]

A. MartinT. KnauthS. CreutzD. BeckerS. WeigertC. Fetzer and A. Brito, Low-overhead fault tolerance for high-throughput data processing systems, Proc. the 31st International Conference on Distributed Computing Systems (ICDCS 2011), (2011), 689-699.  doi: 10.1109/ICDCS.2011.29.  Google Scholar

[14]

P. Mell and T. Grance, The NIST Definition of Cloud Computing, Recommendations of the National Institute of Standards and Technology, NIST Special Publication 800-145,2011. doi: 10.6028/NIST.SP.800-145.  Google Scholar

[15]

M. TaifiJ. Y. Shi and A. Khreishah, SpotMPI: A framework for auction-based HPC computing using Amazon spot instances, Proc. the 11th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP 2011), (2011), 109-120.  doi: 10.1007/978-3-642-24669-2_11.  Google Scholar

[16]

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, 17 (1974), 530-531.  doi: 10.1145/361147.361115.  Google Scholar

[17]

M. ZahariaT. DasH. LiT. HunterS. Shenker and I. Stoica, Discretized streams: Fault-tolerant streaming computation at scale, Proc. the 24th ACM Symposium on Operating Systems Principles (SOSP 2013), (2013), 423-438.  doi: 10.1145/2517349.2522737.  Google Scholar

show all references

References:
[1]

L. A. Barroso and U. Hölzle, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Morgan & Claypool Publishers, California, 2009. doi: 10.2200/S00193ED1V01Y200905CAC006.  Google Scholar

[2]

T. C. Bressoud and M. A. Kozuch, Cluster fault-tolerance: An experimental evaluation of checkpointing and MapReduce through simulation in Proc. the IEEE International Conference on Cluster Computing and Workshops (CLUSTER 2009), (2009). doi: 10.1109/CLUSTR.2009.5289185.  Google Scholar

[3]

C. L. P. Chen and C.-Y. Zhang, Data-intensive applications, challenges, techniques and technologies: A survey on big data, Information Sciences, 275 (2014), 314-347.  doi: 10.1016/j.ins.2014.01.015.  Google Scholar

[4]

T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy and R. Sears, MapReduce online, in Proc. the 7th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2010), (2010). Google Scholar

[5]

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, 22 (2006), 303-312.  doi: 10.1016/j.future.2004.11.016.  Google Scholar

[6]

J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters, Communications of the ACM, 51 (2008), 107-113.  doi: 10.1145/1327452.1327492.  Google Scholar

[7]

J. Dean, Designs, lessons and advice from building large distributed systems, in Keynote Presentation of the 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware (LADIS 2009), (2009). Google Scholar

[8]

J. Dean and S. Ghemawat, MapReduce: A flexible data processing tool, Communications of the ACM, 53 (2010), 72-77.  doi: 10.1145/1629175.1629198.  Google Scholar

[9]

S. Di, Y. Robert, F. Vivien, D. Kondo, C. -L. Wang and F. Cappello, Optimization of cloud task processing with checkpoint-restart mechanism in Proc. the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 13), (2013). doi: 10.1145/2503210.2503217.  Google Scholar

[10]

L. FialhoD. Rexachs and E. Luque, What is missing in current checkpoint interval models?, Proc. the 31st International Conference on Distributed Computing Systems (ICDCS 2011), (2011), 322-332.  doi: 10.1109/ICDCS.2011.12.  Google Scholar

[11]

B. JavadiD. KondoA. Iosup and D. Epema, The failure trace archive: Enabling the comparison of failure measurements and models of distributed systems, Journal of Parallel and Distributed Computing, 73 (2013), 1208-1223.  doi: 10.1016/j.jpdc.2013.04.002.  Google Scholar

[12]

H. JinY. ChenH. Zhu and X.-H. Sun, Optimizing HPC fault-tolerant environment: An analytical approach, Proc. the 39th International Conference on Parallel Processing (ICPP 2010), (2010), 525-534.  doi: 10.1109/ICPP.2010.80.  Google Scholar

[13]

A. MartinT. KnauthS. CreutzD. BeckerS. WeigertC. Fetzer and A. Brito, Low-overhead fault tolerance for high-throughput data processing systems, Proc. the 31st International Conference on Distributed Computing Systems (ICDCS 2011), (2011), 689-699.  doi: 10.1109/ICDCS.2011.29.  Google Scholar

[14]

P. Mell and T. Grance, The NIST Definition of Cloud Computing, Recommendations of the National Institute of Standards and Technology, NIST Special Publication 800-145,2011. doi: 10.6028/NIST.SP.800-145.  Google Scholar

[15]

M. TaifiJ. Y. Shi and A. Khreishah, SpotMPI: A framework for auction-based HPC computing using Amazon spot instances, Proc. the 11th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP 2011), (2011), 109-120.  doi: 10.1007/978-3-642-24669-2_11.  Google Scholar

[16]

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, 17 (1974), 530-531.  doi: 10.1145/361147.361115.  Google Scholar

[17]

M. ZahariaT. DasH. LiT. HunterS. Shenker and I. Stoica, Discretized streams: Fault-tolerant streaming computation at scale, Proc. the 24th ACM Symposium on Operating Systems Principles (SOSP 2013), (2013), 423-438.  doi: 10.1145/2517349.2522737.  Google Scholar

Figure 1.  Processing of a subtask with checkpointing method
Figure 2.  Mean task-processing time with respect to the number of checkpoints for various $M$ ($b = 24$ [hour], $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of analysis and simulation
Figure 3.  Mean task-processing time with respect to the number of checkpoints for various $b$ ($M = 100$, $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of analysis and simulation
Figure 4.  Mean task-processing time with respect to the number of checkpoints for various $c$ ($M = 100$, $b = 24$ [hour], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of analysis and simulation
Figure 5.  Mean task-processing time with respect to the number of checkpoints for various $f$ ($M = 100$, $b = 24$ [hour], $c = 300$ [sec], $r = 300$ [sec]): Comparison between the results of analysis and simulation
Figure 6.  Mean task-processing time with respect to the number of checkpoints for various $r$ ($M = 100$, $b = 24$ [hour], $f = 30$ [day], $c = 300$ [sec]): Comparison between the results of analysis and simulation
Figure 7.  Mean task-processing time with respect to $M$ for the optimal number of checkpoints ($b = 24$ [hour], $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of previous and proposal analyses and simulation
Figure 8.  Mean task-processing time with respect to $b$ for the optimal number of checkpoints ($M = 100$, $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of previous and proposal analyses and simulation
Figure 9.  Mean task-processing time with respect to $c$ for the optimal number of checkpoints ($M = 100$, $b = 24$ [hour], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of previous and proposal analyses and simulation
Figure 10.  Mean task-processing time with respect to $f$ for the optimal number of checkpoints ($M = 100$, $b = 24$ [hour], $c = 300$ [sec], $r = 300$ [sec]): Comparison between the results of previous and proposal analyses and simulation
Figure 11.  Mean task-processing time with respect to $r$ for the optimal number of checkpoints ($M = 100$, $c = 300$ [sec], $b = 24$ [hour], $f = 30$ [day]): Comparison between the results of previous and proposal analyses and simulation
Figure 12.  Mean task-processing time with respect to $M$ for the optimal number of checkpoints ($b = 24$ [hour], $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison among three distributions for the time intervals between consecutive worker failures
Figure 13.  Mean task-processing time with respect to $b$ for the optimal number of checkpoints ($M = 100$, $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison among three distributions for the time intervals between consecutive worker failures
Figure 14.  Mean task-processing time with respect to $c$ for the optimal number of checkpoints ($M = 100$, $b = 24$ [hour], $f = 30$ [day], $r = 300$ [sec]): Comparison among three distributions for the time intervals between consecutive worker failures
Figure 15.  Mean task-processing time with respect to $f$ for the optimal number of checkpoints ($M = 100$, $b = 24$ [hour], $c = 300$ [sec], $r = 300$ [sec]): Comparison among three distributions for the time intervals between consecutive worker failures
Figure 16.  Mean task-processing time with respect to $r$ for the optimal number of checkpoints ($M = 100$, $c = 300$ [sec], $b = 24$ [hour], $f = 30$ [day]): Comparison among three distributions for the time intervals between consecutive worker failures
Figure 17.  Mean task-processing time with respect to small $f$ for the optimal number of checkpoints ($M = 100$, $b = 24$ [hour], $c = 300$ [sec], $f = 1$ to $7$ [day], $r = 300$ [sec]): Comparison among three distributions for the time intervals between consecutive worker failures
Table 1.  Parameter set.
ParameterDescriptionValue
$M$Number of workers$10$ to $1,000$
$b$Subtask-processing time$6$ to $120$ [hour]
$c$Time to make a checkpoint$30$ to $3,000$ [sec]
$f$Mean time between worker failures$7$ to $180$ [day]
$r$Time to resume a failed subtask$30$ to $3,000$ [sec]
$K$Number of checkpoints$0$ to $30$
ParameterDescriptionValue
$M$Number of workers$10$ to $1,000$
$b$Subtask-processing time$6$ to $120$ [hour]
$c$Time to make a checkpoint$30$ to $3,000$ [sec]
$f$Mean time between worker failures$7$ to $180$ [day]
$r$Time to resume a failed subtask$30$ to $3,000$ [sec]
$K$Number of checkpoints$0$ to $30$
[1]

Karl-Peter Hadeler, Frithjof Lutscher. Quiescent phases with distributed exit times. Discrete & Continuous Dynamical Systems - B, 2012, 17 (3) : 849-869. doi: 10.3934/dcdsb.2012.17.849

[2]

Lekbir Afraites, Abdelghafour Atlas, Fahd Karami, Driss Meskine. Some class of parabolic systems applied to image processing. Discrete & Continuous Dynamical Systems - B, 2016, 21 (6) : 1671-1687. doi: 10.3934/dcdsb.2016017

[3]

Jan Prüss, Laurent Pujo-Menjouet, G.F. Webb, Rico Zacher. Analysis of a model for the dynamics of prions. Discrete & Continuous Dynamical Systems - B, 2006, 6 (1) : 225-235. doi: 10.3934/dcdsb.2006.6.225

[4]

Raz Kupferman, Cy Maor. The emergence of torsion in the continuum limit of distributed edge-dislocations. Journal of Geometric Mechanics, 2015, 7 (3) : 361-387. doi: 10.3934/jgm.2015.7.361

[5]

Qiang Guo, Dong Liang. An adaptive wavelet method and its analysis for parabolic equations. Numerical Algebra, Control & Optimization, 2013, 3 (2) : 327-345. doi: 10.3934/naco.2013.3.327

[6]

Longxiang Fang, Narayanaswamy Balakrishnan, Wenyu Huang. Stochastic comparisons of parallel systems with scale proportional hazards components equipped with starting devices. Journal of Industrial & Management Optimization, 2020  doi: 10.3934/jimo.2021004

[7]

Xiaomao Deng, Xiao-Chuan Cai, Jun Zou. A parallel space-time domain decomposition method for unsteady source inversion problems. Inverse Problems & Imaging, 2015, 9 (4) : 1069-1091. doi: 10.3934/ipi.2015.9.1069

[8]

Martial Agueh, Reinhard Illner, Ashlin Richardson. Analysis and simulations of a refined flocking and swarming model of Cucker-Smale type. Kinetic & Related Models, 2011, 4 (1) : 1-16. doi: 10.3934/krm.2011.4.1

[9]

Rui Hu, Yuan Yuan. Stability, bifurcation analysis in a neural network model with delay and diffusion. Conference Publications, 2009, 2009 (Special) : 367-376. doi: 10.3934/proc.2009.2009.367

[10]

Seung-Yeal Ha, Shi Jin. Local sensitivity analysis for the Cucker-Smale model with random inputs. Kinetic & Related Models, 2018, 11 (4) : 859-889. doi: 10.3934/krm.2018034

[11]

Carlos Fresneda-Portillo, Sergey E. Mikhailov. Analysis of Boundary-Domain Integral Equations to the mixed BVP for a compressible stokes system with variable viscosity. Communications on Pure & Applied Analysis, 2019, 18 (6) : 3059-3088. doi: 10.3934/cpaa.2019137

2019 Impact Factor: 1.366

Metrics

  • PDF downloads (159)
  • HTML views (1176)
  • Cited by (0)

[Back to Top]