This post is for Big Data cluster admins who want to take a stock of which active directory group has access to which hive/impala database in the cluster.
In Hadoop Clusters the most common and recommended way of access control is via groups where in central active directory groups are created (on windows environment) with unix attributes (like group id, home directory and login shell, etc) set properly so that they are relevant on unix environment as well. Once these groups are created the new/existing users who need access to the hadoop environment or hive tables are provided required access by adding these users to the identified groups.
Ofcourse they should also have the corresponding unix attributes set as well like userid, groupid (of which the user is member of), home directory and the login shell. This is required because unix doesn’t work based on names (like cn or common name in active directory) but it is all based on ids in Unix. So even if you define groups and add users to those groups you can’t use them for access control in unix until unix attributes are defined for those groups or users in active directory itself. Then it is the job of unix admins to have SSSD + PAM module installed on the hadoop cluster which help in integrating these two disjoint operating systems.
So far so good? The above description is required to help you understand how and from where these group definition come from in most of the live clusters.
Now with HiveServer2 we can use HQL (Hive SQL or Hive Query Language) to GRANT or REVOKE privileges to a database or a table or even a column of the given table too. Sentry basically ties active directory to a hive database (or table or a column) via a role. You can understand a role in sentry as a bridge between two disjoint entries. Imagine a pub-sub model of message broker where there is a topic to which a publisher publishes the message and is not worried about when and who would read the message (i.e. fire and forget) and at the other end there are subscribers of the same topic which get the message as per their own convenience with an additional benefit of adding more subscribers later or remove the existing subscriptions.
Sentry architecture bears a remarkable similarity (to certain extent) to this pub-sub model where in access levels (all, select, etc) on a database are assigned to a role and the role is assigned to one or more different active directory groups. Obviously you can change the access levels assigned to a role which means all groups to which this role is associated with would automatically have changed permissions. And at the same time you can add/delete groups to this role enabling/disabling all the users in that group. Pictorially it is something like
Some times based on the requirement we multitask a single sentry role using which we provide multiple groups with access on multiple databases. In this the sentry configuration could be as complex as:
The HQL commands for these actions look like:
- Creating a sentry role: CREATE ROLE [role_name];
- Providing database access to a role: GRANT SELECT ON DATABASE database_name TO ROLE role_name;
- Associating a group with the role: GRANT ROLE role_name TO GROUP <groupName>;
The overall summary of the entire process is:
- Have the required groups created in central active directory (with unix attributes)
- Get the appropriate users added to the relevant groups (with unix attributes) like RiskBusinessSelect, RiskBusinessAll
- Create Sentry role(s)
- Provide relevant access to sentry role(s) on database(s) (or its resources)
- Associate a Sentry role with an active directory group (obviously as per the access requirement)
Till this point is seems all streamlined right from creating the group in AD till their association with an appropriate role in Apache Sentry where in you can easily find out which all roles are associated with which databases and groups to roles association as well. The simple commands for these are:
- List all the active roles for current logged in user (only admin can see all the roles): SHOW CURRENT ROLES
- Role to database association: SHOW GRANT ROLE <roleName> on <DATABASE|TABLE><db | table Name>;
- Group to role association: SHOW ROLE GRANT GROUP <groupName>;
In case you have observed you can get database to role mapping and role to group mapping but not something like database to group mapping. Imagine a scenario, may be for audit reports, where you have to get details on which all user groups have active access to hive databases along with the access level. You might think of using sub queries and get this associative data back using relevant joins but unfortunately it doesn’t work that way as sub queries are not select based (see carefully they all start with Show) so trying something like
select rln.role_name from ( SHOW CURRENT ROLES ) rln;
will not work here. So what do we do here? Is there any way out?
Yes we have a back-end based solution where we can get details from the sentry database itself. You can get the details of the database and its credentials from Ambari (if you are using Hortonworks) or Cloudera Manager (if you are using Cloudera Distribution of Hadoop). These are the main Sentry Tables:
- SENTRY_GROUP: Stores the group names with local sequential ids for the groups used in sentry privileges
- SENTRY_ROLE: Stores the name and id of the roles created.
- SENTRY_ROLE_GROUP_MAP: Stores the mapping between Sentry roles and Sentry groups using the ids defined in above two tables
- SENTRY_DB_PRIVILEGE: Stores the hive object details (database or table) with the assigned access level.
- SENTRY_ROLE_DB_PRIVILEGE_MAP: It is a mapping between Sentry role (#2 above) and the the provided access to database objects (#4 above)
You can simply write a Unix script to connect to this database and produce the desired output. The SQL query for the desired information could look like (output would be a single column with comma separated values so that it can be saved in csv directly):
SELECT DB_PRV.DB_NAME || ',' || GRP.GROUP_NAME || ',' || REPLACE(DB_PRV.ACTION,'*','all')
from SENTRY_DB_PRIVILEGE DB_PRV
INNER JOIN SENTRY_ROLE_DB_PRIVILEGE_MAP ROLE_TO_DB_PRV ON
DB_PRV.DB_PRIVILEGE_ID = ROLE_TO_DB_PRV.DB_PRIVILEGE_ID
AND upper(PRIVILEGE_SCOPE) = 'DATABASE'
INNER join SENTRY_ROLE_GROUP_MAP ROLE_TO_GROUP ON
ROLE_TO_GROUP.ROLE_ID = ROLE_TO_DB_PRV.ROLE_ID
INNER join SENTRY_GROUP GRP ON
GRP.GROUP_ID = ROLE_TO_GROUP.GROUP_ID
ORDER BY DB_NAME
A couple of points:
- In case you want to search for only tables then use value for PRIVILEGE_SCOPE column as TABLE (AND upper(PRIVILEGE_SCOPE) = ‘TABLE’ ).
- In case you want to limit the search and run the script with predefined set of databases or tables then you can add a where clause before ORDER BY as: WHERE trim(DB_PRV.DB_NAME) IN (@replacewithvalues@)
I hope you would find this article useful. Do let me know your feedback or comments.