# American Institute of Mathematical Sciences

• Previous Article
Optimal liability ratio and dividend payment strategies under catastrophic risk
• JIMO Home
• This Issue
• Next Article
Compensation plan, pricing and production decisions with inventory-dependent salvage value, and asymmetric risk-averse sales agent
October  2018, 14(4): 1423-1442. doi: 10.3934/jimo.2018014

## Performance optimization of parallel-distributed processing with checkpointing for cloud environment

 1 Graduate School of Informatics, Kyoto University, Yoshida-Hommachi, Sakyo-ku, Kyoto 606-8501, Japan 2 Graduate School of Information Science, Nara Institute of Science and Technology, 8916-5 Takayama, Ikoma, Nara 630-0192, Japan

Received  December 2016 Revised  August 2017 Published  October 2018 Early access  January 2018

Citation: Tsuguhito Hirai, Hiroyuki Masuyama, Shoji Kasahara, Yutaka Takahashi. Performance optimization of parallel-distributed processing with checkpointing for cloud environment. Journal of Industrial and Management Optimization, 2018, 14 (4) : 1423-1442. doi: 10.3934/jimo.2018014
##### References:
 [1] L. A. Barroso and U. Hölzle, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Morgan & Claypool Publishers, California, 2009. doi: 10.2200/S00193ED1V01Y200905CAC006. [2] T. C. Bressoud and M. A. Kozuch, Cluster fault-tolerance: An experimental evaluation of checkpointing and MapReduce through simulation in Proc. the IEEE International Conference on Cluster Computing and Workshops (CLUSTER 2009), (2009). doi: 10.1109/CLUSTR.2009.5289185. [3] C. L. P. Chen and C.-Y. Zhang, Data-intensive applications, challenges, techniques and technologies: A survey on big data, Information Sciences, 275 (2014), 314-347.  doi: 10.1016/j.ins.2014.01.015. [4] T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy and R. Sears, MapReduce online, in Proc. the 7th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2010), (2010). [5] J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, 22 (2006), 303-312.  doi: 10.1016/j.future.2004.11.016. [6] J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters, Communications of the ACM, 51 (2008), 107-113.  doi: 10.1145/1327452.1327492. [7] J. Dean, Designs, lessons and advice from building large distributed systems, in Keynote Presentation of the 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware (LADIS 2009), (2009). [8] J. Dean and S. Ghemawat, MapReduce: A flexible data processing tool, Communications of the ACM, 53 (2010), 72-77.  doi: 10.1145/1629175.1629198. [9] S. Di, Y. Robert, F. Vivien, D. Kondo, C. -L. Wang and F. Cappello, Optimization of cloud task processing with checkpoint-restart mechanism in Proc. the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 13), (2013). doi: 10.1145/2503210.2503217. [10] L. Fialho, D. Rexachs and E. Luque, What is missing in current checkpoint interval models?, Proc. the 31st International Conference on Distributed Computing Systems (ICDCS 2011), (2011), 322-332.  doi: 10.1109/ICDCS.2011.12. [11] B. Javadi, D. Kondo, A. Iosup and D. Epema, The failure trace archive: Enabling the comparison of failure measurements and models of distributed systems, Journal of Parallel and Distributed Computing, 73 (2013), 1208-1223.  doi: 10.1016/j.jpdc.2013.04.002. [12] H. Jin, Y. Chen, H. Zhu and X.-H. Sun, Optimizing HPC fault-tolerant environment: An analytical approach, Proc. the 39th International Conference on Parallel Processing (ICPP 2010), (2010), 525-534.  doi: 10.1109/ICPP.2010.80. [13] A. Martin, T. Knauth, S. Creutz, D. Becker, S. Weigert, C. Fetzer and A. Brito, Low-overhead fault tolerance for high-throughput data processing systems, Proc. the 31st International Conference on Distributed Computing Systems (ICDCS 2011), (2011), 689-699.  doi: 10.1109/ICDCS.2011.29. [14] P. Mell and T. Grance, The NIST Definition of Cloud Computing, Recommendations of the National Institute of Standards and Technology, NIST Special Publication 800-145,2011. doi: 10.6028/NIST.SP.800-145. [15] M. Taifi, J. Y. Shi and A. Khreishah, SpotMPI: A framework for auction-based HPC computing using Amazon spot instances, Proc. the 11th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP 2011), (2011), 109-120.  doi: 10.1007/978-3-642-24669-2_11. [16] J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, 17 (1974), 530-531.  doi: 10.1145/361147.361115. [17] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker and I. Stoica, Discretized streams: Fault-tolerant streaming computation at scale, Proc. the 24th ACM Symposium on Operating Systems Principles (SOSP 2013), (2013), 423-438.  doi: 10.1145/2517349.2522737.

show all references

##### References:
 [1] L. A. Barroso and U. Hölzle, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Morgan & Claypool Publishers, California, 2009. doi: 10.2200/S00193ED1V01Y200905CAC006. [2] T. C. Bressoud and M. A. Kozuch, Cluster fault-tolerance: An experimental evaluation of checkpointing and MapReduce through simulation in Proc. the IEEE International Conference on Cluster Computing and Workshops (CLUSTER 2009), (2009). doi: 10.1109/CLUSTR.2009.5289185. [3] C. L. P. Chen and C.-Y. Zhang, Data-intensive applications, challenges, techniques and technologies: A survey on big data, Information Sciences, 275 (2014), 314-347.  doi: 10.1016/j.ins.2014.01.015. [4] T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy and R. Sears, MapReduce online, in Proc. the 7th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2010), (2010). [5] J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, 22 (2006), 303-312.  doi: 10.1016/j.future.2004.11.016. [6] J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters, Communications of the ACM, 51 (2008), 107-113.  doi: 10.1145/1327452.1327492. [7] J. Dean, Designs, lessons and advice from building large distributed systems, in Keynote Presentation of the 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware (LADIS 2009), (2009). [8] J. Dean and S. Ghemawat, MapReduce: A flexible data processing tool, Communications of the ACM, 53 (2010), 72-77.  doi: 10.1145/1629175.1629198. [9] S. Di, Y. Robert, F. Vivien, D. Kondo, C. -L. Wang and F. Cappello, Optimization of cloud task processing with checkpoint-restart mechanism in Proc. the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 13), (2013). doi: 10.1145/2503210.2503217. [10] L. Fialho, D. Rexachs and E. Luque, What is missing in current checkpoint interval models?, Proc. the 31st International Conference on Distributed Computing Systems (ICDCS 2011), (2011), 322-332.  doi: 10.1109/ICDCS.2011.12. [11] B. Javadi, D. Kondo, A. Iosup and D. Epema, The failure trace archive: Enabling the comparison of failure measurements and models of distributed systems, Journal of Parallel and Distributed Computing, 73 (2013), 1208-1223.  doi: 10.1016/j.jpdc.2013.04.002. [12] H. Jin, Y. Chen, H. Zhu and X.-H. Sun, Optimizing HPC fault-tolerant environment: An analytical approach, Proc. the 39th International Conference on Parallel Processing (ICPP 2010), (2010), 525-534.  doi: 10.1109/ICPP.2010.80. [13] A. Martin, T. Knauth, S. Creutz, D. Becker, S. Weigert, C. Fetzer and A. Brito, Low-overhead fault tolerance for high-throughput data processing systems, Proc. the 31st International Conference on Distributed Computing Systems (ICDCS 2011), (2011), 689-699.  doi: 10.1109/ICDCS.2011.29. [14] P. Mell and T. Grance, The NIST Definition of Cloud Computing, Recommendations of the National Institute of Standards and Technology, NIST Special Publication 800-145,2011. doi: 10.6028/NIST.SP.800-145. [15] M. Taifi, J. Y. Shi and A. Khreishah, SpotMPI: A framework for auction-based HPC computing using Amazon spot instances, Proc. the 11th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP 2011), (2011), 109-120.  doi: 10.1007/978-3-642-24669-2_11. [16] J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, 17 (1974), 530-531.  doi: 10.1145/361147.361115. [17] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker and I. Stoica, Discretized streams: Fault-tolerant streaming computation at scale, Proc. the 24th ACM Symposium on Operating Systems Principles (SOSP 2013), (2013), 423-438.  doi: 10.1145/2517349.2522737.
Processing of a subtask with checkpointing method
Mean task-processing time with respect to the number of checkpoints for various $M$ ($b = 24$ [hour], $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of analysis and simulation
Mean task-processing time with respect to the number of checkpoints for various $b$ ($M = 100$, $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of analysis and simulation
Mean task-processing time with respect to the number of checkpoints for various $c$ ($M = 100$, $b = 24$ [hour], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of analysis and simulation
Mean task-processing time with respect to the number of checkpoints for various $f$ ($M = 100$, $b = 24$ [hour], $c = 300$ [sec], $r = 300$ [sec]): Comparison between the results of analysis and simulation
Mean task-processing time with respect to the number of checkpoints for various $r$ ($M = 100$, $b = 24$ [hour], $f = 30$ [day], $c = 300$ [sec]): Comparison between the results of analysis and simulation
Mean task-processing time with respect to $M$ for the optimal number of checkpoints ($b = 24$ [hour], $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of previous and proposal analyses and simulation
Mean task-processing time with respect to $b$ for the optimal number of checkpoints ($M = 100$, $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of previous and proposal analyses and simulation
Mean task-processing time with respect to $c$ for the optimal number of checkpoints ($M = 100$, $b = 24$ [hour], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of previous and proposal analyses and simulation
Mean task-processing time with respect to $f$ for the optimal number of checkpoints ($M = 100$, $b = 24$ [hour], $c = 300$ [sec], $r = 300$ [sec]): Comparison between the results of previous and proposal analyses and simulation
Mean task-processing time with respect to $r$ for the optimal number of checkpoints ($M = 100$, $c = 300$ [sec], $b = 24$ [hour], $f = 30$ [day]): Comparison between the results of previous and proposal analyses and simulation
Mean task-processing time with respect to $M$ for the optimal number of checkpoints ($b = 24$ [hour], $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison among three distributions for the time intervals between consecutive worker failures
Mean task-processing time with respect to $b$ for the optimal number of checkpoints ($M = 100$, $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison among three distributions for the time intervals between consecutive worker failures
Mean task-processing time with respect to $c$ for the optimal number of checkpoints ($M = 100$, $b = 24$ [hour], $f = 30$ [day], $r = 300$ [sec]): Comparison among three distributions for the time intervals between consecutive worker failures
Mean task-processing time with respect to $f$ for the optimal number of checkpoints ($M = 100$, $b = 24$ [hour], $c = 300$ [sec], $r = 300$ [sec]): Comparison among three distributions for the time intervals between consecutive worker failures
Mean task-processing time with respect to $r$ for the optimal number of checkpoints ($M = 100$, $c = 300$ [sec], $b = 24$ [hour], $f = 30$ [day]): Comparison among three distributions for the time intervals between consecutive worker failures
Mean task-processing time with respect to small $f$ for the optimal number of checkpoints ($M = 100$, $b = 24$ [hour], $c = 300$ [sec], $f = 1$ to $7$ [day], $r = 300$ [sec]): Comparison among three distributions for the time intervals between consecutive worker failures
Parameter set.
 Parameter Description Value $M$ Number of workers $10$ to $1,000$ $b$ Subtask-processing time $6$ to $120$ [hour] $c$ Time to make a checkpoint $30$ to $3,000$ [sec] $f$ Mean time between worker failures $7$ to $180$ [day] $r$ Time to resume a failed subtask $30$ to $3,000$ [sec] $K$ Number of checkpoints $0$ to $30$
 Parameter Description Value $M$ Number of workers $10$ to $1,000$ $b$ Subtask-processing time $6$ to $120$ [hour] $c$ Time to make a checkpoint $30$ to $3,000$ [sec] $f$ Mean time between worker failures $7$ to $180$ [day] $r$ Time to resume a failed subtask $30$ to $3,000$ [sec] $K$ Number of checkpoints $0$ to $30$
 [1] Tsuguhito Hirai, Hiroyuki Masuyama, Shoji Kasahara, Yutaka Takahashi. Performance analysis of large-scale parallel-distributed processing with backup tasks for cloud computing. Journal of Industrial and Management Optimization, 2014, 10 (1) : 113-129. doi: 10.3934/jimo.2014.10.113 [2] Kyosuke Hashimoto, Hiroyuki Masuyama, Shoji Kasahara, Yutaka Takahashi. Performance analysis of backup-task scheduling with deadline time in cloud computing. Journal of Industrial and Management Optimization, 2015, 11 (3) : 867-886. doi: 10.3934/jimo.2015.11.867 [3] Xuena Yan, Shunfu Jin, Wuyi Yue, Yutaka Takahashi. Performance analysis and system optimization of an energy-saving mechanism in cloud computing with correlated traffic. Journal of Industrial and Management Optimization, 2021  doi: 10.3934/jimo.2021106 [4] Ji-Bo Wang, Bo Zhang, Hongyu He. A unified analysis for scheduling problems with variable processing times. Journal of Industrial and Management Optimization, 2022, 18 (2) : 1063-1077. doi: 10.3934/jimo.2021008 [5] Bin Zheng, Min Fan, Mengqi Liu, Shang-Chia Liu, Yunqiang Yin. Parallel-machine scheduling with potential disruption and positional-dependent processing times. Journal of Industrial and Management Optimization, 2017, 13 (2) : 697-711. doi: 10.3934/jimo.2016041 [6] Jiping Tao, Zhijun Chao, Yugeng Xi. A semi-online algorithm and its competitive analysis for a single machine scheduling problem with bounded processing times. Journal of Industrial and Management Optimization, 2010, 6 (2) : 269-282. doi: 10.3934/jimo.2010.6.269 [7] Shunfu Jin, Haixing Wu, Wuyi Yue, Yutaka Takahashi. Performance evaluation and Nash equilibrium of a cloud architecture with a sleeping mechanism and an enrollment service. Journal of Industrial and Management Optimization, 2020, 16 (5) : 2407-2424. doi: 10.3934/jimo.2019060 [8] Jinsong Xu. Reversible hidden data access algorithm in cloud computing environment. Discrete and Continuous Dynamical Systems - S, 2019, 12 (4&5) : 1219-1232. doi: 10.3934/dcdss.2019084 [9] Serap Ergün, Bariş Bülent Kırlar, Sırma Zeynep Alparslan Gök, Gerhard-Wilhelm Weber. An application of crypto cloud computing in social networks by cooperative game theory. Journal of Industrial and Management Optimization, 2020, 16 (4) : 1927-1941. doi: 10.3934/jimo.2019036 [10] Weidong Bao, Haoran Ji, Xiaomin Zhu, Ji Wang, Wenhua Xiao, Jianhong Wu. ACO-based solution for computation offloading in mobile cloud computing. Big Data & Information Analytics, 2016, 1 (1) : 1-13. doi: 10.3934/bdia.2016.1.1 [11] Leiyang Wang, Zhaohui Liu. Heuristics for parallel machine scheduling with batch delivery consideration. Journal of Industrial and Management Optimization, 2014, 10 (1) : 259-273. doi: 10.3934/jimo.2014.10.259 [12] Chengwen Jiao, Qi Feng. Research on the parallel–batch scheduling with linearly lookahead model. Journal of Industrial and Management Optimization, 2021, 17 (6) : 3551-3558. doi: 10.3934/jimo.2020132 [13] Min Ji, Xinna Ye, Fangyao Qian, T.C.E. Cheng, Yiwei Jiang. Parallel-machine scheduling in shared manufacturing. Journal of Industrial and Management Optimization, 2022, 18 (1) : 681-691. doi: 10.3934/jimo.2020174 [14] Min-Fan He, Li-Ning Xing, Wen Li, Shang Xiang, Xu Tan. Double layer programming model to the scheduling of remote sensing data processing tasks. Discrete and Continuous Dynamical Systems - S, 2019, 12 (4&5) : 1515-1526. doi: 10.3934/dcdss.2019104 [15] Yicheng Liu, Yipeng Chen, Jun Wu, Xiao Wang. Periodic consensus in network systems with general distributed processing delays. Networks and Heterogeneous Media, 2021, 16 (1) : 139-153. doi: 10.3934/nhm.2021002 [16] Hongtruong Pham, Xiwen Lu. The inverse parallel machine scheduling problem with minimum total completion time. Journal of Industrial and Management Optimization, 2014, 10 (2) : 613-620. doi: 10.3934/jimo.2014.10.613 [17] Min Zhang, Gang Li. Multi-objective optimization algorithm based on improved particle swarm in cloud computing environment. Discrete and Continuous Dynamical Systems - S, 2019, 12 (4&5) : 1413-1426. doi: 10.3934/dcdss.2019097 [18] Alireza Goli, Taha Keshavarz. Just-in-time scheduling in identical parallel machine sequence-dependent group scheduling problem. Journal of Industrial and Management Optimization, 2021  doi: 10.3934/jimo.2021124 [19] Arminda Moreno-Díaz, Gabriel de Blasio, Moreno-Díaz Jr.. Distributed, layered and reliable computing nets to represent neuronal receptive fields. Mathematical Biosciences & Engineering, 2014, 11 (2) : 343-361. doi: 10.3934/mbe.2014.11.343 [20] Yining Cao, Chuck Jia, Roger Temam, Joseph Tribbia. Mathematical analysis of a cloud resolving model including the ice microphysics. Discrete and Continuous Dynamical Systems, 2021, 41 (1) : 131-167. doi: 10.3934/dcds.2020219

2021 Impact Factor: 1.411