\`x^2+y_1+z_12^34\`
Advanced Search
Article Contents
Article Contents

Performance optimization of parallel-distributed processing with checkpointing for cloud environment

Abstract Full Text(HTML) Figure(17) / Table(1) Related Papers Cited by
  • In cloud computing, the most successful application framework is parallel-distributed processing, in which an enormous task is split into a number of subtasks and those are processed independently on a cluster of machines referred to as workers. Due to its huge system scale, worker failures occur frequently in cloud environment and failed subtasks cause a large processing delay of the task. One of schemes to alleviate the impact of failures is checkpointing method, with which the progress of a subtask is recorded as checkpoint and the failed subtask is resumed by other worker from the latest checkpoint. This method can reduce the processing delay of the task. However, frequent checkpointing is system overhead and hence the checkpoint interval must be set properly. In this paper, we consider the optimal number of checkpoints which minimizes the task-processing time. We construct a stochastic model of parallel-distributed processing with checkpointing and approximately derive explicit expressions for the mean task-processing time and the optimal number of checkpoints. Numerical experiments reveal that the proposed approximations are sufficiently accurate on typical environment of cloud computing. Furthermore, the derived optimal number of checkpoints outperforms the result of previous study for minimizing the task-processing time on parallel-distributed processing.

    Mathematics Subject Classification: Primary: 68M20, 90B22; Secondary: 60K30.

    Citation:

    \begin{equation} \\ \end{equation}
  • 加载中
  • Figure 1.  Processing of a subtask with checkpointing method

    Figure 2.  Mean task-processing time with respect to the number of checkpoints for various $M$ ($b = 24$ [hour], $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of analysis and simulation

    Figure 3.  Mean task-processing time with respect to the number of checkpoints for various $b$ ($M = 100$, $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of analysis and simulation

    Figure 4.  Mean task-processing time with respect to the number of checkpoints for various $c$ ($M = 100$, $b = 24$ [hour], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of analysis and simulation

    Figure 5.  Mean task-processing time with respect to the number of checkpoints for various $f$ ($M = 100$, $b = 24$ [hour], $c = 300$ [sec], $r = 300$ [sec]): Comparison between the results of analysis and simulation

    Figure 6.  Mean task-processing time with respect to the number of checkpoints for various $r$ ($M = 100$, $b = 24$ [hour], $f = 30$ [day], $c = 300$ [sec]): Comparison between the results of analysis and simulation

    Figure 7.  Mean task-processing time with respect to $M$ for the optimal number of checkpoints ($b = 24$ [hour], $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of previous and proposal analyses and simulation

    Figure 8.  Mean task-processing time with respect to $b$ for the optimal number of checkpoints ($M = 100$, $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of previous and proposal analyses and simulation

    Figure 9.  Mean task-processing time with respect to $c$ for the optimal number of checkpoints ($M = 100$, $b = 24$ [hour], $f = 30$ [day], $r = 300$ [sec]): Comparison between the results of previous and proposal analyses and simulation

    Figure 10.  Mean task-processing time with respect to $f$ for the optimal number of checkpoints ($M = 100$, $b = 24$ [hour], $c = 300$ [sec], $r = 300$ [sec]): Comparison between the results of previous and proposal analyses and simulation

    Figure 11.  Mean task-processing time with respect to $r$ for the optimal number of checkpoints ($M = 100$, $c = 300$ [sec], $b = 24$ [hour], $f = 30$ [day]): Comparison between the results of previous and proposal analyses and simulation

    Figure 12.  Mean task-processing time with respect to $M$ for the optimal number of checkpoints ($b = 24$ [hour], $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison among three distributions for the time intervals between consecutive worker failures

    Figure 13.  Mean task-processing time with respect to $b$ for the optimal number of checkpoints ($M = 100$, $c = 300$ [sec], $f = 30$ [day], $r = 300$ [sec]): Comparison among three distributions for the time intervals between consecutive worker failures

    Figure 14.  Mean task-processing time with respect to $c$ for the optimal number of checkpoints ($M = 100$, $b = 24$ [hour], $f = 30$ [day], $r = 300$ [sec]): Comparison among three distributions for the time intervals between consecutive worker failures

    Figure 15.  Mean task-processing time with respect to $f$ for the optimal number of checkpoints ($M = 100$, $b = 24$ [hour], $c = 300$ [sec], $r = 300$ [sec]): Comparison among three distributions for the time intervals between consecutive worker failures

    Figure 16.  Mean task-processing time with respect to $r$ for the optimal number of checkpoints ($M = 100$, $c = 300$ [sec], $b = 24$ [hour], $f = 30$ [day]): Comparison among three distributions for the time intervals between consecutive worker failures

    Figure 17.  Mean task-processing time with respect to small $f$ for the optimal number of checkpoints ($M = 100$, $b = 24$ [hour], $c = 300$ [sec], $f = 1$ to $7$ [day], $r = 300$ [sec]): Comparison among three distributions for the time intervals between consecutive worker failures

    Table 1.  Parameter set.

    ParameterDescriptionValue
    $M$Number of workers$10$ to $1,000$
    $b$Subtask-processing time$6$ to $120$ [hour]
    $c$Time to make a checkpoint$30$ to $3,000$ [sec]
    $f$Mean time between worker failures$7$ to $180$ [day]
    $r$Time to resume a failed subtask$30$ to $3,000$ [sec]
    $K$Number of checkpoints$0$ to $30$
     | Show Table
    DownLoad: CSV
  •   L. A. Barroso and U. Hölzle, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Morgan & Claypool Publishers, California, 2009. doi: 10.2200/S00193ED1V01Y200905CAC006.
      T. C. Bressoud and M. A. Kozuch, Cluster fault-tolerance: An experimental evaluation of checkpointing and MapReduce through simulation in Proc. the IEEE International Conference on Cluster Computing and Workshops (CLUSTER 2009), (2009). doi: 10.1109/CLUSTR.2009.5289185.
      C. L. P. Chen  and  C.-Y. Zhang , Data-intensive applications, challenges, techniques and technologies: A survey on big data, Information Sciences, 275 (2014) , 314-347.  doi: 10.1016/j.ins.2014.01.015.
      T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy and R. Sears, MapReduce online, in Proc. the 7th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2010), (2010).
      J. T. Daly , A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, 22 (2006) , 303-312.  doi: 10.1016/j.future.2004.11.016.
      J. Dean  and  S. Ghemawat , MapReduce: Simplified data processing on large clusters, Communications of the ACM, 51 (2008) , 107-113.  doi: 10.1145/1327452.1327492.
      J. Dean, Designs, lessons and advice from building large distributed systems, in Keynote Presentation of the 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware (LADIS 2009), (2009).
      J. Dean  and  S. Ghemawat , MapReduce: A flexible data processing tool, Communications of the ACM, 53 (2010) , 72-77.  doi: 10.1145/1629175.1629198.
      S. Di, Y. Robert, F. Vivien, D. Kondo, C. -L. Wang and F. Cappello, Optimization of cloud task processing with checkpoint-restart mechanism in Proc. the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 13), (2013). doi: 10.1145/2503210.2503217.
      L. Fialho , D. Rexachs  and  E. Luque , What is missing in current checkpoint interval models?, Proc. the 31st International Conference on Distributed Computing Systems (ICDCS 2011), (2011) , 322-332.  doi: 10.1109/ICDCS.2011.12.
      B. Javadi , D. Kondo , A. Iosup  and  D. Epema , The failure trace archive: Enabling the comparison of failure measurements and models of distributed systems, Journal of Parallel and Distributed Computing, 73 (2013) , 1208-1223.  doi: 10.1016/j.jpdc.2013.04.002.
      H. Jin , Y. Chen , H. Zhu  and  X.-H. Sun , Optimizing HPC fault-tolerant environment: An analytical approach, Proc. the 39th International Conference on Parallel Processing (ICPP 2010), (2010) , 525-534.  doi: 10.1109/ICPP.2010.80.
      A. Martin , T. Knauth , S. Creutz , D. Becker , S. Weigert , C. Fetzer  and  A. Brito , Low-overhead fault tolerance for high-throughput data processing systems, Proc. the 31st International Conference on Distributed Computing Systems (ICDCS 2011), (2011) , 689-699.  doi: 10.1109/ICDCS.2011.29.
      P. Mell and T. Grance, The NIST Definition of Cloud Computing, Recommendations of the National Institute of Standards and Technology, NIST Special Publication 800-145,2011. doi: 10.6028/NIST.SP.800-145.
      M. Taifi , J. Y. Shi  and  A. Khreishah , SpotMPI: A framework for auction-based HPC computing using Amazon spot instances, Proc. the 11th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP 2011), (2011) , 109-120.  doi: 10.1007/978-3-642-24669-2_11.
      J. W. Young , A first order approximation to the optimum checkpoint interval, Communications of the ACM, 17 (1974) , 530-531.  doi: 10.1145/361147.361115.
      M. Zaharia , T. Das , H. Li , T. Hunter , S. Shenker  and  I. Stoica , Discretized streams: Fault-tolerant streaming computation at scale, Proc. the 24th ACM Symposium on Operating Systems Principles (SOSP 2013), (2013) , 423-438.  doi: 10.1145/2517349.2522737.
  • 加载中

Figures(17)

Tables(1)

SHARE

Article Metrics

HTML views(1538) PDF downloads(289) Cited by(0)

Access History

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return