Availability of the Jobtracker Machine in Hadoop/Map-Reduce Implementations.

ABSTRACT

Due to the growing demand for Cloud Computing services, the need and importance of Distributed Systems cannot be underestimated.

However,  it is difficult to use the traditional Message Passing Interface (MPI) ap- proach to implement synchronization, coordination,and prevent deadlocks in distributed systems.

This difficulty is lessened by the use of Apache’s Hadoop/MapReduce and Zookeeper to provide Fault Tolerance in a Homo- geneously Distributed Hardware/Software environment.

In this thesis, a mathematical model for the availability of the JobTracker in Hadoop/MapReduce using Zookeeper’s Leader Election Service is examined.

Though the availability is less than what is expected in a k Fault Tolerance system for higher values of the hardware failure rate, this approach makes coordination and synchronization easy, reduces the effect of Crash failures, and provides Fault Tolerance for distributed systems.

The availability model starts with a Markov  state  diagram  for  a  general case of N Zookeeper servers followed by specific cases of 3,4,and 5 servers. Both software and hardware faults are considered in  addition to  the  effect  of hardware and software repair rates.

Comparisons show that, the system availability changes with change in the number of Zookeeper servers, with 3 servers having the highest availability.

The model presented in this study can be used to decide on how many servers are optimal for maximum availability and from which vendor they must be purchased. It can also help determine what time to use a Zookeeper coordinated Hadoop cluster to perform critical tasks.

TABLE OF CONTENTS

Declaration      i

Acknowledgement      ii

List of Tables        vi

List of Figures  viii

Abstract     ix

  • Introduction 1
    • Problem Statement. 2
    • Objectives…… 4
    • Thesis Organization……. 4
  • Cloud Computing and Fault Tolerance 5
    • Cloud Computing…….. 5
    • Types of Clouds… 6
    • Virtualization in the Cloud……. 7
      • Advantages of virtualization 7
    • Fault, Error and Failure.. 7
    • Fault Tolerance……. 9
      • Fault-tolerance Properties 9
      • K Fault Tolerant Systems 12
      • Hardware Fault Tolerance 13
      • Software Fault Tolerance 14
    • Properties of a Fault Tolerant Cloud 15
      • Availability…… 15
      • Reliability….. 16
      • Scalability….. 17
    • Hadoop/MapReduce Architecture 18
      • Hadoop/MapReduce…. 18
      • MapReduce…… 20
      • Hadoop/MapReduce versus other Systems 21
        • Relational Database Management Systems (RDBMS) 21
        • Grid Computing… 22
        • Volunteer Computing 23
      • Features of MapReduce……. 23
        • Automatic Parallelization and Distribution of Work .  23
        • Fault Tolerance in Hadoop/MapReduce 23
        • Cost Efficiency…. 24
        • Simplicity 24
      • Limitations of Hadoop/MapReduce 25
      • Apache’s ZooKeeper….. 25
        • ZooKeeper Data Model… 26
        • Zookeeper Guarantees… 27
        • Zookeeper Primitives……….. 28
        • Zookeeper Fault Tolerance… 29
      • Related Work……. 29
    • Availability Model 32
      • JobTracker Availability Model 32
      • Model Assumptions…….. 33
      • Markov Model for a Multi-Host System 33
        • The Parameter λs(t)….. 35
      • Markov Model for a Three-Host (N = 3) Hadoop/MapReduce Cluster Using

Zookeeper as Coordinating Service.. 35

  • Numerical Solution to the System of

Differential Equations….. 41

  • Interpretation of Availability plot of the JobTracker . 41   6…. Discussion of Results        44

4.6.1    Sensitivity Analysis….. 44

  • Conclusion and Future Work 51
    • Conclusion…. 51
    • Future Work……… 52

Appendix 53

INTRODUCTION

The effectiveness of most modern information (data) processing involves the ability to process huge datasets in parallel to meet stringent time con- straints and organizational needs. A major challenge facing organizations today is the ability to organize and process large data generated by cus- tomers.

According to Nielson Online[1] there are more than 1,733,993,741 internet users. How much data these users are generating and how it is pro- cessed largely determines the success of the organization concerned.

Consider the social networking site Facebook; as at August 2011, it has over 750 million active users[2] who spend 700 billion minutes per month on the network.

They install over 20 million applications every day and interact with 30 billion pieces of content (web links, news stories, blog posts, notes, photo albums, etc.) each month. Since April 2010 when social plugins were launched, an average of 10,000 new websites has integrated with Facebook.

The amount of data generated in Facebook is estimated as follows [3]:12 TB of compressed data added per day 800 TB of compressed data scanned per day 25,000 map-reduce jobs per day 65 million files in HDFS. 30,000 simultaneous clients to the HDFS NameNode.

It was a similar demand to process large datasets in Google that inspired Engineers in Google to introduce MapReduce.

At Google MapReduce is used to build Index for Google Search, Article clustering for Google News and perform Statistical machine translations. At Yahoo!, it is used to build Index for Yahoo! Search and spam detection.

REFERENCES

http://hadoop-karma.blogspot.com/2010/03/how-much-data-is- generated-on-internet.html, http://www.nielsen.com/us/en.html

http://www.facebook.com/press/info.php?statistics http://hadoopblog.blogspot.com/2010/05/facebook-has-worlds- largest-hadoop.html

Matei Zaharia, A Presentation on Cloud Computing with MapReduce and Hadoop, UC Berkeley AMP Lab, 2010.

Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Google, Inc. 2004.

Grant Mackey, Saba Sehrish, John Bent, Julio Lopez, Salman Habib, Jun Wang. Introducing Map-Reduce to High End Computing, Uni- versity of Central Florida, Los Alamos National Lab, Carnegie Melon University.

By admin

Leave a Reply

Your email address will not be published. Required fields are marked *