Set up a Hadoop Cluster using Ansible

6 min readMar 21, 2021

Ansible can be used to automate cloud provisioning, configuration management, deployment and other IT operations just by writing playbooks .It is an amazing open-source tool that can increase our productivity at a large scale, saving us a lot of time when we need to perform configuration management on multiple nodes.

We’ll be automating a Hadoop cluster setup with the help of Ansible.

Setting up Ansible

First, run the ansible --version command , this is to check what version of Ansible you have installed. If this command does not run, then you can install Ansible using pip (Python needs to be installed for Ansible). To install Ansible with pip, you can run pip3 install ansible.

Next we will be needing to create an inventory file that holds the IP address of all our managed nodes. So , we can create an inventory file at any location (eg. vi /root/ipINV.txt, preferably in some directory where you can later keep your ansible configuration file too.

In the inventory add your managed node info in the following format:

[namenode]
<IP address of namenode> ansible_user=root ansible_ssh_pass=<password> ansible_connection=ssh
[datanode]
<IP address of datanode> ansible_user=root ansible_ssh_pass=<password> ansible_connection=ssh

[namenode] and [datanode] are labels that we can use when writing our playbook. You can name your labels as you like.

Next, create a directory for your ansible configuration file,

[root@localhost ~]# mkdir /etc/ansible

In this directory, create a configuration file, vim ansible.cfg and add the following content,

[defaults]
inventory = <path to inventory file>
host_key_checking = False

Connecting through ssh depends on another software called sshpass. To install on Red Hat 8, enter dnf install sshpass (this package is found in the epel-release repo, so we need to make sure yum is configured with the epel-release repository).

With this, our setup for ansible is complete. We can now start writing our playbook to configure Hadoop.

Creating Playbook

Create a directory as your workspace, for example mkdir /hadoopwsInside this workspace, create a playbook (extension .yml), for example,vim hadoop.yml.

Our first line will have the hosts keyword which takes the label of the host-group on which you wish to perform the tasks mentioned. Let’s start by installing the JDK and Hadoop installation files on both Name Node and Data Node.

Copying Installation files for JDK and Hadoop

In my case, I copied the JDK and Hadoop installation files saved on my Controller Node into the Managed Nodes, but you can also use other ansible modules like get_url to directly download the files from a given URL.

Here I use loop and built in variable item to copy multiple files.
Be sure that you download the JDK supported by the Hadoop version you install. Refer here to check which versions are compatible.

You can run your playbook using ansible-playbook <playbook name> to see if everything is working fine. If correct, your Data Node and Name Node should have both the files saved in the destination that you gave.

Installing Java and Hadoop and stopping Firewall

We can use the yum module to install a package with the path as the name attribute, but we need to enter the — force option for the Hadoop installation which is why we can just use the command module.

We’ll also stop the firewall service on both nodes so that they connect to each other when we start the namenode and datanode.

Configuring Name Node

In the name node, we take input for our directory name using the vars_prompt module. Our tasks consists of,
- file module to create the directory
- lineinfile module to enter the configuration lines in the hdfs-site.xml and core-site.xml files in the location /etc/hadoop/. We use the groups[‘namenode’][0] variable to fetch the IP of the namenode from the Inventory file. This is possible because we labelled our name node IP as ‘namenode’.
- We finally format and start the name node using the command module with commands, hadoop namenode -format -force and hadoop-daemon.sh start namenode .

Configuring Data Node

We perform the same operations as the Name Node on the Data Node, except that we use a different variable name for directory, and our hdfs-site.xml changes a bit too. After this we can directly start the datanode.

Complete Playbook

- hosts: namenode, datanode
  tasks:
       - name: "Copying Installation Files"
         copy:
             src: "{{ item }}"
             dest: "/root/Downloads/"
         loop:
             - /root/Downloads/jdk-8u171-linux-x64.rpm
             - /root/Downloads/hadoop-1.2.1-1.x86_64.rpm- name: "Installing Java and Hadoop"
         ignore_errors: yes
         command: "rpm -i {{ item }}"
         loop:
             - /root/Downloads/jdk-8u171-linux-x64.rpm
             - /root/Downloads/hadoop-1.2.1-1.x86_64.rpm --force- name: "Stopping firewalld service"
         ignore_errors: yes
         command: "systemctl stop firewalld"- hosts: namenode
  vars_prompt:
       - name: nndir
         private: no
         prompt: "Enter location directory path and name for Name Node"tasks:
       - name: "Creating Name Node Directory"
         file:
             state: directory
             path: "{{ nndir }}"- name: "Configuring hdfs-site.xml in Name Node"
         lineinfile:
                 path: "/etc/hadoop/hdfs-site.xml"
                 insertafter: "<configuration>"
                 line: "<property>
                   \n\t <name>dfs.name.dir</name>
                   \n\t <value>{{ nndir }}</value>
                     \n </property>"- name: "Configuring core-site.xml in Name Node"
         lineinfile:
                 path: "/etc/hadoop/core-site.xml"
                 insertafter: "<configuration>"
                 line: "<property>
                   \n\t <name>fs.default.name</name>
                   \n\t <value>hdfs://{{ groups['namenode'][0] }}:9001</value>
                     \n </property>"- name: "Formatting Name Node Directory"
         ignore_errors: yes
         command: "hadoop namenode -format -force"
       - name: "Starting Name Node daemon"
         ignore_errors: yes
         command: "hadoop-daemon.sh start namenode"- hosts: datanode
  vars_prompt:
       - name: dndir
         private: no
         prompt: "Enter location directory path and name for Data Node"tasks:
       - name: "Creating Data Node Directory"
         file:
             state: directory
             path: "{{ dndir }}"- name: "Configuring hdfs-site.xml in Data Node"
         lineinfile:
                 path: "/etc/hadoop/hdfs-site.xml"
                 insertafter: "<configuration>"
                 line: "<property>
                   \n\t <name>dfs.data.dir</name>
                   \n\t <value>{{ dndir }}</value>
                     \n </property>"- name: "Configuring core-site.xml in Data Node"
         lineinfile:
                 path: "/etc/hadoop/core-site.xml"
                 insertafter: "<configuration>"
                 line: "<property>
                   \n\t <name>fs.default.name</name>
                   \n\t <value>hdfs://{{ groups['namenode'][0] }}:9001</value>
                     \n </property>"- name: "Starting Data Node daemon"
         ignore_errors: yes
         command: "hadoop-daemon.sh start datanode"

Run the playbook,

with the command ansible-playbook <playbook-name>We need to make sure that your data node is sharing storage by just running hadoop dfsadmin -report on either the name node or datanode.

More data nodes can be added by simply adding their IP addresses to the inventory file under the data node label.
We need to keep in mind that running this playbook repeatedly may not work if any cache files for hadoop are saved in the system. So, to re-run the playbook for a new cluster, make sure that you delete the /etc/hadoop/ directory (the namenode and datanode directories, i.e /nn and /dn, should be deleted too if you’re giving same directory names as before when running the playbook).

Set up a Hadoop Cluster using Ansible

Complete Playbook

Run the playbook,

Written by TAMANNA VERMA