HDInsight Creating Local OS and Ambari users via the REST API

HDInsight is a semi-managed Hadoop cluster on the Microsoft Azure cloud. Although the standard version isn't geared towards multiple users from a security perspective I recently had to figure out a way to create local users across the cluster at cluster build time. This blog post describes one way you could do this.

Overview


 The way we will create users on the cluster at boot time is:

  • Create a CSV file containing usernames,user and group ids, shell etc.
  • Store the CSV file on the default Azure Storage account
  • Attach the Storage account to the HDInsight cluster
  • Deploy the cluster with an ARM template that uses a custom script which creates the user accounts
  • The script
    • Determines the cluster name
    • Based on the cluster name it looks for a file named <clustername>-user-list.csv on the storage account
    • Copies the file to the node and iterates through the lines in the file and:
      • Creates local OS users;
      • Create local OS groups for the users;
      • If the users is an admin user it adds them to sudoers;
      • Creates user accounts in Ambari - note that in a HDInsight standard cluster the user accounts in Ambari are separate to the OS level accounts;
      • Creates Pig and Oozie views if they do not already exist;
      • Adds the user to either the
        • clusteruser group which will have read only access to cluster stats/configuration and access to Hive views etc in Ambari
        • clusteradministrator which will have full access to manage everything through Ambari
        • grants access to various Ambari views to the aforementioned groups


The CSV file containing the user details includes the uid, we do this to ensure the users have the same uid across all nodes in the cluster.

The creation of Ambari users, groups, checking membership of the Ambari groups and creating views makes heavy use of the Ambari REST API.

At the moment the script lists all the storage accounts associated with the cluster and finds one that contains the string artifacts and looks on this storage account for the user list CSV file. The script needs to be improved by making the storage account and container names parameters.

I have uploaded the script to github here https://github.com/vijayjt/AzureHDInsight/blob/master/script-actions/create-local-users.sh


Ambari REST API


As mentioned the script makes heavy use of the Ambari REST API.


  • Checking a user exists
    • We do this by calling the REST API endpoint http://${ACTIVEAMBARIHOST}:8080/api/v1/users
    • Then iterating to through the users returned to see if the username is in this list
    • Lines 2 - 16 in the gist shown below provides some example code for checking if a user exists in Ambari (this is an extract from the full script mentioned above).
  • Check if a user is a member of an Ambari Group
    • We call the REST API endpoint http://${ACTIVEAMBARIHOST}:8080/api/v1/groups/${GROUP_TO_CHECK}/members
    • To obtain a list of users that are a member of the specified group 
    • Then we iterate through the list to see if the user is in the list
  • Adding a user to Ambari
    • As shown in line 88 of the gist, we make a post request to the endpoint http://${ACTIVEAMBARIHOST}:8080/api/v1/users with a JSON body that contains the username and password
  • Adding a group to Ambari
    • As shown in line 91 of the gist, we make a post request to the endpoint http://${ACTIVEAMBARIHOST}:8080/api/v1/groups with a JSON body that contains the group name
  • Adding a user to a group in Ambari
    • As shown in line 94 of the gist we make a post request to the endpoint http://${ACTIVEAMBARIHOST}:8080/api/v1/groups/${ambarigroup}/members with a JSON body that contains the username and group name


Important considerations

PAM Configuration

It should be noted that Microsoft do modify the pam configuration files which can lead to an issue where the passwd <username> command will prompt you twice because they have added kerberos related configuration items to PAM. I suspect they made this change for Azure HDInsight Premium which supports domain joined clusters but accidentally included this on Standard clusters as well. I will write a separate post on how to change the PAM configuration.

Force password change

When the account is created you should force the user to change their password shortly afterwards.

Security of user list file on Azure storage

The user passwords are stored in plaintext on the Azure storage account. This is one of the reasons we delete the file after cluster provisioning. If you were to leave it on the storage account any user on the cluster will be able to view it using hdfs commands.

If you persisted the script and you scale the cluster from the Azure portal and add additional nodes you will need to reinstate the file otherwise the local OS users will not be created.

It goes without saying but you should ensure storage account encryption is enabled and the container is private.

As an alternative you could look to distribute SSH keys instead of using password authentication.

What other options are there for provisioning users on HDInsight Standard?

If you are using a configuration management tool such as Chef, Ansible or Puppet, you could create the accounts via one of these tools and also use it to distribute SSH keys so that no passwords are involved. 

If you do take this approach you need to be careful that there are no scripts that rely upon the chef/ansible code to execute first and create the users otherwise the script actions may fail or you may end up with a race condition.

Comments