Installing Cloudera on Azure (part 1 of 2)

godatadriven/
02 December, 2015


Recently, GoDataDriven installed a Cloudera cluster on Microsoft Azure. In this article we provide some information about the use case, the implemented design and some basic information about Microsoft Azure. We discuss how you can install Cloudera on Azure and share best-practices about installing a distributed system on Azure.

Introduction

Processing large amounts of unstructured data requires serious computing power and also maintenance effort. As load on computing power typically fluctuates due to time and seasonal influences and/or processes running on certain times, a cloud solution like Microsoft Azure is a good option to be able to scale up easily and pay only for what is actually used.

The use-case: Data Science infrastructure for large European airport

In order to explore and analyze data, one of the largest European airports approached GoDataDriven to set-up a pre-production environment with adequate security measures, robust enough for production load. Cloudera was selected as Hadoop distribution to be installed on MS Azure.

The first use-case to motivate the installation of Cloudera was to be able to leverage the computing mechanisms offered to do predictive maintenance of the airports' assets. By collecting data from all the assets in the Cloudera cluster, the airport was able to develop predictive models for the maintenance based on power usage, visitor flow and sensor data.

Requirements & design

Requirements

The general requirements for our use case were:

And the following technical requirements were added:

Design

Our goal was to build a production ready Hadoop cluster, including provisioning machines. Of course there were also some security related requirements which we had to take into account. For example: Users and administrator had to use a VPN tunnel to be able to access the Azure environment. Single sign-on had to be enable on all machines and Sentry had to be configured.

So we ended up creating a design which is presented in the picture below:

Cloudera Enterprise Data Hub

The GatewaySubnet is needed to be able to set up the Site2Site VPN between the client’s network and the Azure network where the Hadoop cluster resides.

For user management we decided to set up two Active Directory servers, which are also Domain Name Server. These are placed in their own subnet.

Because of the high traffic between all nodes in the cluster, the Hadoop machines are in their own subnet. You might wonder why, but let us explain shortly how HDFS works: When you write a file into HDFS, this file is split into blocks (block size is usually 128 MB) and these blocks are placed on the Datanodes. Each block has a replication factor of 3. Only the master node (Namenode) knows which block belongs to which file. The Namenode does not store blocks, but it does maintain the active replication factor. If a client wants to read a file from HDFS, it will first contact the Namenode, get the location of the blocks and then read the blocks from the Datanodes.
The Datanodes send heartbeats to the Namenode and the when the active Namenode notices that a block hasn’t got the requested replication factor, it instructs another Datanode to copy that given block.

So now you can see that there is a lot of traffic between the Namenode - Datanode and also between Datanode – Datanode. And here we only mentioned the communication between one component, HDFS....

We chose to put the Hadoop cluster in its own subnet because like this we can also restrict better who can access this subnet.

We also have a ClientSubnet for the machines which can access the cluster. Users can connect to the machines in this subnet, do their analysis, but are not able to SSH to the machines in the Hadoop subnet.

By configuring single sign-on using the Active Directory on the Linux level and configuring Kerberos using Active Directory for the Hadoop services, users can use a single password everywhere.

To be able to provide a working environment to the customer, we had to learn a couple of things about Microsoft Azure, the preferred cloud provider of the customer.

Microsoft Azure basics to start a cluster

Microsoft Azure is a cloud service for both infrastructure-as-a-service (IaaS) and platform-as-a-service (PaaS), with data centers spanning the globe. We only needed the IaaS part to be able to set up the network and start machines, so that's what we'll explain further.

For Cloudera Enterprise deployments on Azure, the following service offerings are relevant:

Deployment modes

At the moment Azure has two deployment modes available:

  1. ASM (Azure Service Management)
  2. ARM (Azure Resource Manager)

The ASM API is the “old” or “classic” API , and correlates to the web portal. Azure Service Management is an XML-driven REST API, which adds some overhead to API calls, compared to JSON.

The Azure Resource Manager (ARM) API is a JSON-driven REST API to interact with Azure cloud resources. Microsoft recommends deploying in ARM mode.

One of the benefits of using the ARM API is that you can declare cloud resources as part of what’s called an “ARM JSON template.” An ARM JSON template is a specially-crafted JSON file that contains cloud resource definitions. Once the resources have been declared in the JSON template, the template file is deployed into a container called a Resource Group. An ARM Resource Group is a logical set of correlated cloud resources that roughly share a life span. Using the ARM mode you are able to deploy resources in parallel, which was a limitation in ASM.

The new Azure Ibiza Preview Portal is used to provision Azure cloud resources with ARM. If you provision cloud resources using the “old” Azure Portal, they are provisioned through the ASM API. You are not limited to the portal to deploy your templates. You can use the PowerShell or the Azure command-line interface to manage all Azure resources and deploy complete templates. The Azure CLI is based on NodeJs and thereby available on all environments. Both ARM and ASM are modes which can be configured using the CLI. You can check which mode you are working in using the

azure config

command and change the mode using the

azure config mode <arm/asm>
command.

When deciding which deployment model to use, you have to watch out, because resources deployed in the ASM mode cannot be seen by the resources deployed in the ARM mode by default. If you want to achieve this, you would need to create a VPN tunnel between the two VNets.

To understand the differences between ASM and ARM better, read the Azure Compute, Network, and Storage Providers under the Azure Resource Manager article.

How to install Cloudera on Azure?

There are multiple way in which you can install Cloudera on Azure. After some considerations we ended up with the following 2 scenarios:

  1. Install everything from scratch:

    • Provision machines and network using Azure CLI
    • Use a provisioning tool (we prefer Ansible) to do the Linux configuration
    • Install Cloudera Manager
    • Install CDH (Cloudera Distribution for Apache Hadoop) using Cloudera Manager
  2. Cloudera provides an ARM template which installs a Hadoop cluster, including OS and network tuning and Hadoop configuration tuning. The template can be found on GitHub. There is also an Azure VM image built and maintained by Cloudera which is used during deployment. This image is available on the Azure Marketplace. Out-of-the-box features of the template:

    • Create a Resource Group for all the components
    • Create VNet and subnets
    • Create availability sets. Place masters and workers in different availability sets
    • Create security groups
    • Create Masternode and Workernode instances using the Cloudera VM Image (CentOS image built and maintained by Cloudera). The template automatically uses Azure DS14 machines, which are the only machine types recommended and supported by Cloudera for Hadoop installations.
    • For each host a Premium Storage account is created
    • Add disks to the machines, format and mount the disks (10 data disks of 1 TB per node)
    • Set up forward/reverse lookup between hosts using /etc/hosts file
    • Tune Linux OS and network configurations like disable SELinux, disable IPtables, TCP tuning parameters, disable huge pages
    • Set up time synchronization to an external server (NTPD)
    • Set up Cloudera Manager and the database used by the Cloudera Manager
    • Set up Hadoop services using the Cloudera Python API

The template also has a disadvantage: it is meant to start up a cluster, but you cannot create extra data nodes and add them to the cluster. The template does not provision a gateway machine for you.

After analyzing the gaps between the template provided by Cloudera and the client requirements, we decided to chose a golden middle-way:

In out next blog we will discuss what were these gaps in more detail.

Best practices for a manual implementation

We also identified a couple of best practices which we would need to keep in mind if we wouldn’t use the template:

Further information

Earlier, on November 3rd 2015, GoDataDriven already presented a webinar about running Cloudera on MS Azure, which has been recorded and is available for viewing after registration.

To be continued...

In the next post we will elaborate further on the gaps we saw between our client requirements and the Cloudera template and we will show how we modified the Cloudera template to fulfill our requirements.

Subscribe to our newsletter

Stay up to date on the latest insights and best-practices by registering for the GoDataDriven newsletter.