|
L. A. Barroso and U. Hölzle,
The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Morgan & Claypool Publishers, California, 2009.
doi: 10.2200/S00193ED1V01Y200905CAC006.
|
|
T. C. Bressoud and M. A. Kozuch, Cluster fault-tolerance: An experimental evaluation of checkpointing and MapReduce through simulation in Proc. the IEEE International Conference on Cluster Computing and Workshops (CLUSTER 2009), (2009).
doi: 10.1109/CLUSTR.2009.5289185.
|
|
C. L. P. Chen
and C.-Y. Zhang
, Data-intensive applications, challenges, techniques and technologies: A survey on big data, Information Sciences, 275 (2014)
, 314-347.
doi: 10.1016/j.ins.2014.01.015.
|
|
T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy and R. Sears, MapReduce online, in Proc. the 7th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2010), (2010).
|
|
J. T. Daly
, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, 22 (2006)
, 303-312.
doi: 10.1016/j.future.2004.11.016.
|
|
J. Dean
and S. Ghemawat
, MapReduce: Simplified data processing on large clusters, Communications of the ACM, 51 (2008)
, 107-113.
doi: 10.1145/1327452.1327492.
|
|
J. Dean, Designs, lessons and advice from building large distributed systems, in Keynote Presentation of the 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware (LADIS 2009), (2009).
|
|
J. Dean
and S. Ghemawat
, MapReduce: A flexible data processing tool, Communications of the ACM, 53 (2010)
, 72-77.
doi: 10.1145/1629175.1629198.
|
|
S. Di, Y. Robert, F. Vivien, D. Kondo, C. -L. Wang and F. Cappello, Optimization of cloud task processing with checkpoint-restart mechanism in Proc. the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 13), (2013).
doi: 10.1145/2503210.2503217.
|
|
L. Fialho
, D. Rexachs
and E. Luque
, What is missing in current checkpoint interval models?, Proc. the 31st International Conference on Distributed Computing Systems (ICDCS 2011), (2011)
, 322-332.
doi: 10.1109/ICDCS.2011.12.
|
|
B. Javadi
, D. Kondo
, A. Iosup
and D. Epema
, The failure trace archive: Enabling the comparison of failure measurements and models of distributed systems, Journal of Parallel and Distributed Computing, 73 (2013)
, 1208-1223.
doi: 10.1016/j.jpdc.2013.04.002.
|
|
H. Jin
, Y. Chen
, H. Zhu
and X.-H. Sun
, Optimizing HPC fault-tolerant environment: An analytical approach, Proc. the 39th International Conference on Parallel Processing (ICPP 2010), (2010)
, 525-534.
doi: 10.1109/ICPP.2010.80.
|
|
A. Martin
, T. Knauth
, S. Creutz
, D. Becker
, S. Weigert
, C. Fetzer
and A. Brito
, Low-overhead fault tolerance for high-throughput data processing systems, Proc. the 31st International Conference on Distributed Computing Systems (ICDCS 2011), (2011)
, 689-699.
doi: 10.1109/ICDCS.2011.29.
|
|
P. Mell and T. Grance, The NIST Definition of Cloud Computing, Recommendations of the National Institute of Standards and Technology, NIST Special Publication 800-145,2011.
doi: 10.6028/NIST.SP.800-145.
|
|
M. Taifi
, J. Y. Shi
and A. Khreishah
, SpotMPI: A framework for auction-based HPC computing using Amazon spot instances, Proc. the 11th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP 2011), (2011)
, 109-120.
doi: 10.1007/978-3-642-24669-2_11.
|
|
J. W. Young
, A first order approximation to the optimum checkpoint interval, Communications of the ACM, 17 (1974)
, 530-531.
doi: 10.1145/361147.361115.
|
|
M. Zaharia
, T. Das
, H. Li
, T. Hunter
, S. Shenker
and I. Stoica
, Discretized streams: Fault-tolerant streaming computation at scale, Proc. the 24th ACM Symposium on Operating Systems Principles (SOSP 2013), (2013)
, 423-438.
doi: 10.1145/2517349.2522737.
|