Data masking and granular access control using Amazon Macie and AWS Lake Formation
Companies have been collecting user data to offer new products, recommend options more relevant to the user’s profile, or, in the case of financial institutions, to be able to facilitate access to higher credit lines or lower interest rates. However, personal data is sensitive as its use enables identification of the person using a specific system or application and in the wrong hands, this data might be used in unauthorized ways. Governments and organizations have created laws and regulations, such as General Data Protection Regulation (GDPR) in the EU, General Data Protection Law (LGPD) in Brazil, and technical guidance such as the Cloud Computing Implementation Guide published by the Association of Banks in Singapore (ABS), that specify what constitutes sensitive data and how companies should manage it. A common requirement is to ensure that consent is obtained for collection and use of personal data and that any data collected is anonymized to protect consumers from data breach risks.
<p>In this blog post, we walk you through a proposed architecture that implements data anonymization by using granular access controls according to well-defined rules. It covers a scenario where a user might not have read access to data, but an application does. A common use case for this scenario is a data scientist working with sensitive data to train machine learning models. The training algorithm would have access to the data, but the data scientist would not. This approach helps reduce the risk of data leakage while enabling innovation using data.</p>
<h2>Prerequisites</h2>
<p>To implement the proposed solution, you must have an active AWS account and <a href="https://aws.amazon.com/iam/" target="_blank" rel="noopener">AWS Identity and Access Management (IAM)</a> permissions to use the following services:</p>
<blockquote>
<p><strong>Note: </strong>If there’s a pre-existing Lake Formation configuration, there might be permission issues when testing this solution. We suggest that you test this solution on a development account that doesn’t yet have Lake Formation active. If you don’t have access to a development account, see more details about the permissions required on your role in the <a href="https://docs.aws.amazon.com/lake-formation/latest/dg/what-is-lake-formation.html" target="_blank" rel="noopener">Lake Formation documentation</a>.</p>
</blockquote>
<p>You must give permission for AWS DMS to create the necessary resources, such as the EC2 instance where you will run DMS tasks. If you have ever worked with DMS, this permission should already exist. Otherwise, you can use CloudFormation to create the necessary roles to deploy the solution. To see if permission already exists, open the AWS Management Console and go to <a href="http://(https/console.aws.amazon.com/iam/)," target="_blank" rel="noopener">IAM</a>, select <strong>Roles</strong>, and see if there is a role called <strong>dms-vpc-role</strong>. If not, <a href="https://docs.aws.amazon.com/dms/latest/userguide/security-iam.html#CHAP_Security.APIRole" target="_blank" rel="noopener">you must create the role</a> during deployment.</p>
<p>We use the Faker library to create dummy data consisting of the following tables:</p>
<ul>
<li>Customer</li>
<li>Bank</li>
<li>Card</li>
</ul>
<h2>Solution overview</h2>
<p>This architecture allows multiple data sources to send information to the data lake environment on AWS, where Amazon S3 is the central data store. After the data is stored in an S3 bucket, Macie analyzes the objects and identifies sensitive data using machine learning (ML) and pattern matching. AWS Glue then uses the information to run a workflow to anonymize the data.</p>
<div id="attachment_33201" class="wp-caption aligncenter">
<img aria-describedby="caption-attachment-33201" src="https://infracom.com.sg/wp-content/uploads/2024/01/img1-1-1.jpg" alt="Figure 1: Solution architecture for data ingestion and identification of PII" width="780" class="size-full wp-image-33201">
<p id="caption-attachment-33201" class="wp-caption-text">Figure 1: Solution architecture for data ingestion and identification of PII</p>
</div>
<p>We will describe two techniques used in the process: data masking and data encryption. After the workflow runs, the data is stored in a separate S3 bucket. This hierarchy of buckets is used to segregate access to data for different user personas.</p>
<p>Figure 1 depicts the solution architecture:</p>
<ol>
<li>The data source in the solution is an Amazon RDS database. Data can be stored in a database on an EC2 instance, in an on-premises server, or even deployed in a different cloud provider.</li>
<li>AWS DMS uses full load, which allows data migration from the source (an Amazon RDS database) into the target S3 bucket — <span>dcp-macie</span> — as a one-time migration. New objects uploaded to the S3 bucket are automatically encrypted using <a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/specifying-s3-encryption.html" target="_blank" rel="noopener">server-side encryption (SSE-S3)</a>.</li>
<li>A personally identifiable information (PII) detection pipeline is invoked after the new Amazon S3 objects are uploaded. Macie analyzes the objects and identifies values that are sensitive. Users can manually identify which fields and values within the files should be classified as sensitive or use the Macie automated sensitive data discovery capabilities.</li>
<li>The sensitive values identified by Macie are sent to EventBridge, invoking <a href="https://aws.amazon.com/kinesis/data-firehose/" target="_blank" rel="noopener">Kinesis Data Firehose</a> to store them in the <span>dcp-glue</span> S3 bucket. AWS Glue uses this data to know which fields to mask or encrypt using an encryption key stored in AWS KMS.
<ol>
<li>Using EventBridge enables an event-based architecture. EventBridge is used as a bridge between Macie and Kinesis Data Firehose, integrating these services.</li>
<li>Kinesis Data Firehose supports data buffering mitigating the risk of information loss when sent by Macie while reducing the overall cost of storing data in Amazon S3. It also allows data to be sent to other locations, such as <a href="https://aws.amazon.com/redshift/" target="_blank" rel="noopener">Amazon Redshift</a> or Splunk, making it available to be analyzed by other products.</li>
</ol> </li>
<li>At the end of this step, Amazon S3 is invoked from a Lambda function that starts the AWS Glue workflow, which masks and encrypts the identified data.
<ol>
<li>AWS Glue starts a crawler on the S3 bucket <span>dcp-macie</span> (a) and the bucket <span>dcp-glue</span> (b) to populate two tables, respectively, created as part of the AWS Glue service.</li>
<li>After that, a Python script is run (c), querying the data in these tables. It uses this information to mask and encrypt the data and then store it in the prefixes <span>dcp-masked</span> (d) and <span>dcp-encrypted</span> (e) in the bucket <span>dcp-athena</span>.</li>
<li>The last step in the workflow is to perform a crawler for each of these prefixes (f) and (g) by creating their respective tables in the AWS Glue Data Catalog.</li>
</ol> </li>
<li>To enable fine-grained access to data, Lake Formation maps permissions to the tags you have configured. The implementation of this part is described further in this post.</li>
<li><a href="https://aws.amazon.com/athena/" target="_blank" rel="noopener">Athena</a> can be used to query the data. Other tools, such as Amazon Redshift or <a href="https://aws.amazon.com/quicksight/" target="_blank" rel="noopener">Amazon Quicksight</a> can also be used, as well as third-party tools.</li>
</ol>
<p>If a user lacks permission to view sensitive data but needs to access it for machine learning model training purposes, AWS KMS can be used. The AWS KMS service manages the encryption keys that are used for data masking and to give access to the training algorithms. Users can see the masked data, but the algorithms can use the data in its original form to train the machine learning models.</p>
<p>This solution uses three personas:</p>
<p><strong>secure-lf-admin:</strong> Data lake administrator. Responsible for configuring the data lake and assigning permissions to data administrators.<br><strong>secure-lf-business-analyst:</strong> Business analyst. No access to certain confidential information.<br><strong>secure-lf-data-scientist:</strong> Data scientist. No access to certain confidential information.</p>
<h2>Solution implementation</h2>
<p>To facilitate implementation, we created a CloudFormation template. The model and other artifacts produced can be found in this <a href="https://github.com/aws-samples/data-masking-fine-grained-access-using-aws-lake-formation" target="_blank" rel="noopener">GitHub</a> repository. You can use the CloudFormation dashboard to review the output of all the deployed features.</p>
<p>Choose the following <strong>Launch Stack</strong> button to deploy the CloudFormation template.</p>
<p><a href="https://aws.amazon.com/blogs/security/data-masking-and-granular-access-control-using-amazon-macie-and-aws-lake-formation/URL%20for%20link%20goes%20here" rel="noopener noreferrer" target="_blank"><img loading="lazy" src="https://d2908q01vomqb2.cloudfront.net/22d200f8670dbdb3e253a90eee5098477c95c23d/2019/06/05/launch-stack-button.png" alt="Select this image to open a link that starts building the CloudFormation stack" width="190" height="36" class="aligncenter size-full wp-image-10149"></a></p>
<h3 id="deploy_the_cloudformation_template">Deploy the CloudFormation template</h3>
<p>To deploy the CloudFormation template and create the resources in your AWS account, follow the steps below.</p>
<ol>
<li>After signing in to the AWS account, deploy the CloudFormation template. On the <strong>Create stack</strong> window, choose <strong>Next</strong>.
<div id="attachment_33204" class="wp-caption aligncenter">
<img aria-describedby="caption-attachment-33204" src="https://infracom.com.sg/wp-content/uploads/2024/01/img2.jpg" alt="Figure 2: CloudFormation create stack screen" width="740" class="size-full wp-image-33204">
<p id="caption-attachment-33204" class="wp-caption-text">Figure 2: CloudFormation create stack screen</p>
</div> </li>
<li>In the following section, enter a name for the stack. Enter a password in the <strong>TestUserPassword</strong> field for Lake Formation personas to use to sign in to the console. When finished filling in the fields, choose <strong>Next</strong>.</li>
<li>On the next screen, review the selected options and choose <strong>Next</strong>.</li>
<li>In the last section, review the information and select <strong>I acknowledge that AWS CloudFormation might create IAM resources with custom names</strong>. Choose <strong>Create Stack</strong>.
<div id="attachment_33226" class="wp-caption aligncenter">
<img aria-describedby="caption-attachment-33226" src="https://infracom.com.sg/wp-content/uploads/2024/01/img3_v2.jpg" alt="Figure 3: List of parameters and values in the CloudFormation stack" width="740" class="size-full wp-image-33226">
<p id="caption-attachment-33226" class="wp-caption-text">Figure 3: List of parameters and values in the CloudFormation stack</p>
</div> </li>
<li>Wait until the stack status changes to <strong>CREATE_COMPLETE</strong>.</li>
</ol>
<p>The deployment process should take approximately 15 minutes to finish.</p>
<h3>Run an AWS DMS task</h3>
<p>To extract the data from the Amazon RDS instance, you must run an AWS DMS task. This makes the data available to Macie in an S3 bucket in Parquet format.</p>
<ol>
<li>Open the <a href="https://console.aws.amazon.com/dms/v2/" target="_blank" rel="noopener">AWS DMS console</a>.</li>
<li>On the navigation bar, for the <strong>Migrate data</strong> option, select <strong>Database migration tasks</strong>.</li>
<li>Select the task with the name <strong>rdstos3task</strong>.</li>
<li>Choose <strong>Actions</strong>.</li>
<li>Choose <strong>Restart/Resume</strong>. The loading process should take around 1 minute.</li>
</ol>
<p>When the status changes to <strong>Load Complete</strong>, you will be able to see the migrated data in the target bucket (<span>dcp-macie--</span>) in the dataset folder. Within each prefix there will be a parquet file that follows the naming pattern: <span>LOAD00000001.parquet</span>. After this step, use Macie to scan the data for sensitive information in the files.</p>
<h3>Run a classification job with Macie </h3>
<p>You must create a data classification job before you can evaluate the contents of the bucket. The job you create will run and evaluate the full contents of your S3 bucket to determine the files stored in the bucket contain PII. This job uses the managed identifiers available in Macie and a custom identifier.</p>
<ol>
<li>Open the <a href="https://console.aws.amazon.com/macie/" target="_blank" rel="noopener">Macie Console</a>, on the navigation bar, select <strong>Jobs</strong>.</li>
<li>Choose <strong>Create job</strong>.</li>
<li>Select the S3 bucket <strong>dcp-macie--</strong> containing the output of the AWS DMS task. Choose <strong>Next</strong> to continue.</li>
<li>On the <strong>Review Bucket</strong> page, verify the selected bucket is <strong>dcp-macie--</strong>, and then choose <strong>Next</strong>.</li>
<li>In <strong>Refine the scope</strong>, create a new job with the following scope:
<ol>
<li><strong>Sensitive data Discovery options: One-time job</strong> (for demonstration purposes, this will be a single discovery job. For production environments, we recommend selecting the <strong>Scheduled job</strong> option, so Macie can analyze objects following a scheduled).</li>
<li><strong>Sampling Depth: 100 percent</strong>.</li>
<li>Leave the other settings at their default values.</li>
</ol> </li>
<li>On <strong>Managed data identifiers options</strong>, select <strong>All</strong> so Macie can use all managed data identifiers. This enables a set of built-in criteria to detect all identified types of sensitive data. Choose <strong>Next</strong>.</li>
<li>On the <a href="https://docs.aws.amazon.com/macie/latest/APIReference/custom-data-identifiers-id.html" target="_blank" rel="noopener">Custom data identifiers</a> option, select <strong>account_number</strong>, and then choose <strong>Next</strong>. With the custom identifier, you can create custom business logic to look for certain patterns in files stored in Amazon S3. In this example, the task generates a discovery job for files that contain data with the following regular expression format <strong>XYZ-</strong> followed by numbers, which is the default format of the false <span>account_number</span> generated in the dataset. The logic used for creating this custom data identifier is included in the CloudFormation template file.</li>
<li>On the <strong>Select allow lists</strong>, choose <strong>Next</strong> to continue.</li>
<li>Enter a name and description for the job.</li>
<li>Choose <strong>Next</strong> to continue.</li>
<li>On <strong>Review and create step</strong>, check the details of the job you created and choose <strong>Submit</strong>.
<div id="attachment_33206" class="wp-caption aligncenter">
<img aria-describedby="caption-attachment-33206" src="https://infracom.com.sg/wp-content/uploads/2024/01/img4.jpg" alt="Figure 4: List of Macie findings detected by the solution" width="740" class="size-full wp-image-33206">
<p id="caption-attachment-33206" class="wp-caption-text">Figure 4: List of Macie findings detected by the solution</p>
</div> </li>
</ol>
<p>The amount of data being scanned directly influences how long the job takes to run. You can choose the <strong>Update</strong> button at the top of the screen, as shown in Figure 4, to see the updated status of the job. This job, based on the size of the test dataset, will take about 10 minutes to complete.</p>
<h3>Run the AWS Glue data transformation pipeline</h3>
<p>After the Macie job is finished, the discovery results are ingested into the bucket <span>dcp-glue--</span>, invoking the AWS Glue step of the workflow (<span>dcp-Workflow</span>), which should take approximately 11 minutes to complete.</p>
<p>To check the workflow progress:</p>
<ol>
<li>Open the AWS Glue console and on navigation bar, select <a href="https://us-east-1.console.aws.amazon.com/glue/home?region=us-east-1#/v2/etl-configuration/workflows" target="_blank" rel="noopener">Workflows (orchestration)</a>.</li>
<li>Next, choose <strong>dcp-workflow</strong>.</li>
<li>Next, select <strong>History</strong> to see the past runs of the dcp-workflow.</li>
</ol>
<p>The AWS Glue job, which is launched as part of the workflow (<span>dcp-workflow</span>), reads the Macie findings to know the exact location of sensitive data. For example, in the <strong>customer</strong> table are <em>name</em> and <em>birthdate</em>. In the <strong>bank</strong> table are <em>account_number</em>, <em>iban</em>, and <em>bban</em>. And in the <strong>card</strong> table are <em>card_number, card_expiration</em>, and <em>card_security_code</em>. After this data is found, the job masks and encrypts the information.</p>
<p>Text encryption is done using an AWS KMS key. Here is the code snippet that provides this functionality:</p>
<div class="hide-language">
<pre class="unlimited-height-code"><code class="lang-text">def encrypt_rows(r):
encrypted_entities = columns_to_be_masked_and_encrypted
try:
for entity in encrypted_entities:
if entity in table_columns:
encrypted_entity = get_kms_encryption(r[entity])
r[entity + '_encrypted'] = encrypted_entity.decode("utf-8")
del r[entity]
except:
print ("DEBUG:",sys.exc_info())
return r
def get_kms_encryption(row):
# Create a KMS client
session = boto3.session.Session()
client = session.client(service_name=’kms’,region_name=region_name)
try:
encryption_result = client.encrypt(KeyId=key_id, Plaintext=row)
blob = encryption_result['CiphertextBlob']
encrypted_row = base64.b64encode(blob)
return encrypted_row
except:
return 'Error on get_kms_encryption function'</code></pre>
</div>
<p>If your application requires access to the unencrypted text, and because access to the AWS KMS encryption key exists, you can use the following excerpt example to access the information:</p>
<div class="hide-language">
<pre class="unlimited-height-code"><code class="lang-text">client.decrypt(<em>CiphertextBlob</em>=base64.b64decode(data_encrypted))
print(decrypted[‘Plaintext’])
<p>After performing all the above steps, the datasets are fully anonymized with tables created in Data Catalog and data stored in the respective S3 buckets. These are the buckets where fine-grained access controls are applied through Lake Formation:</p>
<ul>
<li>Masked data — <code>s3://dcp-athena--/masked/</code></li>
<li>Encrypted data — <code>s3://dcp-athena--/encrypted/</code></li>
</ul>
<p>Now that the tables are defined, you refine the permissions using Lake Formation.</p>
<h3>Enable Lake Formation fine-grained access</h3>
<p>After the data is processed and stored, you use Lake Formation to define and enforce fine-grained access permissions and provide secure access to data analysts and data scientists.</p>
<h4>To enable fine-grained access, you first add a user (<span>secure-lf-admin</span>) to Lake Formation:</h4>
<ol>
<li>In the Lake Formation console, clear <strong>Add myself</strong> and select <strong>Add other AWS users or roles</strong>.</li>
<li>From the drop-down menu, select <strong>secure-lf-admin</strong>.</li>
<li>Choose <strong>Get started</strong>.
<div id="attachment_33212" class="wp-caption aligncenter">
<img aria-describedby="caption-attachment-33212" src="https://infracom.com.sg/wp-content/uploads/2024/01/img5.jpg" alt="Figure 5: Lake Formation deployment process" width="740" class="size-full wp-image-33212" />
<p id="caption-attachment-33212" class="wp-caption-text">Figure 5: Lake Formation deployment process</p>
</div> </li>
</ol>
<h3>Grant access to different personas</h3>
<p>Before you grant permissions to different user personas, you must register Amazon S3 locations in Lake Formation so that the personas can access the data. All buckets have been created with the following pattern <span>---</span>, where <span></span> matches the prefix you selected when you deployed the Cloudformation template and <span></span> corresponds to the selected AWS Region (for example, ap-southeast-1), and <span></span> is the 12 numbers that match your AWS account (for example, 123456789012). For ease of reading, we left only the initial part of the bucket name in the following instructions.</p>
<ol>
<li>In the Lake Formation console, on the navigation bar, on the <strong>Register and ingest</strong> option, select <strong>Data Lake locations</strong>.</li>
<li>Choose <strong>Register location</strong>.</li>
<li>Select the <strong>dcp-glue</strong> bucket and choose <strong>Register Location</strong>.</li>
<li>Repeat for the <strong>dcp-macie/dataset</strong>, <strong>dcp-athena/masked</strong>, and <strong>dcp-athena/encrypted</strong> prefixes.
<div id="attachment_33213" class="wp-caption aligncenter">
<img aria-describedby="caption-attachment-33213" loading="lazy" src="https://infracom.com.sg/wp-content/uploads/2024/01/img6.jpg" alt="Figure 6: Amazon S3 locations registered in the solution" width="471" height="352" class="size-full wp-image-33213" />
<p id="caption-attachment-33213" class="wp-caption-text">Figure 6: Amazon S3 locations registered in the solution</p>
</div> </li>
</ol>
<p>You’re now ready to grant access to different users.</p>
<h2>Granting per-user granular access</h2>
<p>After successfully deploying the AWS services described in the CloudFormation template, you must configure access to resources that are part of the proposed solution.</p>
<h3>Grant read-only accesses to all tables for secure-lf-admin</h3>
<p>Before proceeding you must sign in as the secure-lf-admin user. To do this, sign out from the AWS console and sign in again using the secure-lf-admin credential and password that you set in the CloudFormation template.</p>
<p>Now that you’re signed in as the user who administers the data lake, you can grant read-only access to all tables in the dataset database to the secure-lf-admin user.</p>
<ol>
<li>In the <strong>Permissions</strong> section, select <strong>Data Lake permissions</strong>, and then choose <strong>Grant</strong>.</li>
<li>Select <strong>IAM users and roles</strong>.</li>
<li>Select the <strong>secure-lf-admin</strong> user.</li>
<li>Under <strong>LF-Tags or catalog resources</strong>, select <strong>Named data catalog resources</strong>.</li>
<li>Select the database <strong>dataset</strong>.</li>
<li>For <strong>Tables</strong>, select <strong>All tables</strong>.</li>
<li>In the <strong>Table permissions</strong> section, select <strong>Alter</strong> and <strong>Super</strong>.</li>
<li>Under <strong>Grantable permissions</strong>, select <strong>Alter</strong> and <strong>Super</strong>.</li>
<li>Choose <strong>Grant</strong>.</li>
</ol>
<p>You can confirm your user permissions on the Data Lake permissions page.</p>
<h3>Create tags to grant access</h3>
<p>Return to the Lake Formation console to define tag-based access control for users. You can assign policy tags to Data Catalog resources (databases, tables, and columns) to control access to this type of resources. Only users who receive the corresponding Lake Formation tag (and those who receive access with the resource method <em>named</em>) can access the resources.</p>
<ol>
<li>Open the Lake Formation console, then on the navigation bar, under <strong>Permissions</strong>, select <strong>LF-tags</strong>.</li>
<li>Choose <strong>Add LF Tag</strong>. In the dialog box <strong>Add LF-tag</strong>, for <strong>Key</strong>, enter data, and for <strong>Values</strong>, enter mask. Choose <strong>Add</strong>, and then choose <strong>Add LF-Tag</strong>.</li>
<li>Follow the same steps to add a second tag. For <strong>Key</strong>, enter segment, and for <strong>Values</strong> enter campaign.</li>
</ol>
<h3>Assign tags to users and databases</h3>
<p>Now grant read-only access to the masked data to the secure-lf-data-scientist user.</p>
<ol>
<li>In the Lake Formation console, on the navigation bar, under <strong>Permissions</strong>, select <strong>Data Lake permissions</strong>. </li>
<li>Choose <strong>Grant</strong>.</li>
<li>Under <strong>IAM users and roles</strong>, select <strong>secure-lf-data-scientist</strong> as the user.</li>
<li>In the <strong>LF-Tags or catalog resources</strong> section, select <strong>Resources matched by LF-Tags</strong> and choose <strong>add LF-Tag</strong>. For <strong>Key</strong>, enter data and for <strong>Values</strong>, enter mask.
<div id="attachment_33214" class="wp-caption aligncenter">
<img aria-describedby="caption-attachment-33214" src="https://infracom.com.sg/wp-content/uploads/2024/01/img7.jpg" alt="Figure 7: Creating resource tags for Lake Formation" width="740" class="size-full wp-image-33214" />
<p id="caption-attachment-33214" class="wp-caption-text">Figure 7: Creating resource tags for Lake Formation</p>
</div> </li>
<li>In the section <strong>Database permissions</strong>, in the <strong>Database permissions</strong> part and in <strong>Grantable permissions</strong>, select <strong>Describe.</strong></li>
<li>In the section <strong>Table permissions</strong>, in the <strong>Table permissions</strong> part and in <strong>Grantable permissions</strong>, select <strong>Select</strong>.</li>
<li>Choose <strong>Grant</strong>.
<div id="attachment_33215" class="wp-caption aligncenter">
<img aria-describedby="caption-attachment-33215" src="https://infracom.com.sg/wp-content/uploads/2024/01/img8.jpg" alt="Figure 8: Database and table permissions granted" width="740" class="size-full wp-image-33215" />
<p id="caption-attachment-33215" class="wp-caption-text">Figure 8: Database and table permissions granted</p>
</div> </li>
</ol>
<p>To complete the process and give the secure-lf-data-scientist user access to the dataset_masked database, you must assign the tag you created to the database.</p>
<ol>
<li>On the navigation bar, under <strong>Data Catalog</strong>, select <strong>Databases</strong>.</li>
<li>Select <strong>dataset_masked</strong> and select <strong>Actions</strong>. From the drop-down menu, select <strong>Edit LF-Tags</strong>.</li>
<li>In the section <strong>Edit LF-Tags: dataset_masked</strong>, choose <strong>Assign new LF-Tag</strong>. For <strong>Key</strong>, enter data, and for <strong>Values</strong>, enter mask. Choose <strong>Save</strong>.</li>
</ol>
<h3>Grant read-only accesses to secure-lf-business-analyst</h3>
<p>Now grant the secure-lf-business-analyst user read-only access to certain encrypted columns using <em>column-based permissions</em>.</p>
<ol>
<li>In the Lake Formation console, under <strong>Data Catalog</strong>, select <strong>Databases.</strong></li>
<li>Select the database <strong>dataset_encrypted</strong> and then select <strong>Actions</strong>. From the drop-down menu, choose <strong>Grant</strong>.</li>
<li>Select <strong>IAM users and roles</strong>.</li>
<li>Choose <strong>secure-lf-business-analyst</strong>.</li>
<li>In the <strong>LF-Tags or catalog resources</strong> section, select <strong>Named data catalog resources</strong>.</li>
<li>In the <strong>Database permissions</strong> section, in the <strong>Database permissions</strong> section and in <strong>Grantable permissions</strong>, select <strong>Describe</strong> and <strong>Alter</strong>.</li>
<li>Choose <strong>Grant</strong>.</li>
</ol>
<p>Now give the secure-lf-business-analyst user access to the Customer table, except for the username column.</p>
<ol>
<li>In the Lake Formation console, under <strong>Data Catalog</strong>, select <strong>Databases</strong>.</li>
<li>Select the database <strong>dataset_encrypted</strong> and then, choose <strong>View tables</strong>.</li>
<li>From the <strong>Actions</strong> option in the drop-down menu, select <strong>Grant</strong>.</li>
<li>Select <strong>IAM users and roles</strong>.</li>
<li>Select <strong>secure-lf-business-analyst</strong>.</li>
<li>In the <strong>LF-Tags or catalog resources</strong> part, select <strong>Named data catalog resources</strong>.</li>
<li>In the Database section, leave the <strong>dataset_encrypted</strong> selected.</li>
<li>In the tables section, select the <strong>customer</strong> table.</li>
<li>In the <strong>Table permission</strong> section, in the <strong>Table permission</strong> section and in <strong>Grantable permissions</strong>, choose <strong>Select.</strong></li>
<li>In the <strong>Data Permissions</strong> section, select <strong>Column-based access</strong>.</li>
<li>Select <strong>Include columns</strong> and select the <strong>id</strong>, <strong>username</strong>, <strong>mail, </strong>and<strong> gender</strong> columns, which are the data-less columns encrypted for the <strong>secure-lf-business-analyst</strong> user to have access to.</li>
<li>Choose <strong>Grant</strong>.
<div id="attachment_33216" class="wp-caption aligncenter">
<img aria-describedby="caption-attachment-33216" src="https://infracom.com.sg/wp-content/uploads/2024/01/img9.jpg" alt="Figure 9: Granting access to secure-lf-business-analyst user in the Customer table" width="740" class="size-full wp-image-33216" />
<p id="caption-attachment-33216" class="wp-caption-text">Figure 9: Granting access to secure-lf-business-analyst user in the Customer table</p>
</div> </li>
</ol>
<p>Now give the secure-lf-business-analyst user access to the table Card, only for columns that do not contain PII information.</p>
<ol>
<li>In the Lake Formation console, under <strong>Data Catalog</strong>, choose <strong>Databases</strong>.</li>
<li>Select the database <strong>dataset_encrypted</strong> and choose <strong>View tables</strong>.</li>
<li>Select the table <strong>Card</strong>.</li>
<li>In the <strong>Schema</strong> section, choose <strong>Edit schema</strong>.</li>
<li>Select the <strong>cred_card_provider</strong> column, which is the column that has no PII data.</li>
<li>Choose <strong>Edit tags</strong>.</li>
<li>Choose <strong>Assign new LF-Tag</strong>.</li>
<li>For <strong>Assigned</strong> <strong>keys</strong>, enter <span>segment</span> and for <strong>Values</strong>, enter <span>mask</span>.
<div id="attachment_33217" class="wp-caption aligncenter">
<img aria-describedby="caption-attachment-33217" src="https://infracom.com.sg/wp-content/uploads/2024/01/img10.jpg" alt="Figure 10: Editing tags in Lake Formation tables" width="740" class="size-full wp-image-33217" />
<p id="caption-attachment-33217" class="wp-caption-text">Figure 10: Editing tags in Lake Formation tables</p>
</div> </li>
<li>Choose <strong>Save</strong>, and then choose <strong>Save as new version</strong>.</li>
</ol>
<p>In this step you add the segment tag in the column <span>cred_card_provider</span> to the card table. For the user secure-lf-business-analyst to have access, you need to configure this tag for the user.</p>
<ol>
<li>In the Lake Formation console, under <strong>Permissions</strong>, select <strong>Data Lake permissions</strong>.</li>
<li>Choose <strong>Grant</strong>.</li>
<li>Under <strong>IAM users and roles</strong>, select <strong>secure-lf-business-analyst</strong> as the user.</li>
<li>In the <strong>LF-Tags or catalog resources</strong> section, select <strong>Resources matched by LF-Tags</strong>, and choose <strong>add LF-tag </strong>and for as <strong>Key</strong> enter <span>segment</span> and for <strong>Values</strong>, enter <span>campaign</span>.
<div id="attachment_33218" class="wp-caption aligncenter">
<img aria-describedby="caption-attachment-33218" src="https://infracom.com.sg/wp-content/uploads/2024/01/img11.jpg" alt="Figure 11: Configure tag-based access for user secure-lf-business-analyst" width="740" class="size-full wp-image-33218" />
<p id="caption-attachment-33218" class="wp-caption-text">Figure 11: Configure tag-based access for user secure-lf-business-analyst</p>
</div> </li>
<li>In the <strong>Database permissions</strong> section, in the <strong>Database permissions</strong> part and in <strong>Grantable permissions</strong>, select <strong>Describe</strong> from both options.</li>
<li>In the <strong>Table permission</strong> section, in the <strong>Table permission</strong> part as well as in <strong>Grantable permissions</strong>, choose <strong>Select</strong>.</li>
<li>Choose <strong>Grant</strong>.</li>
</ol>
<p>The next step is to revoke <em>Super</em> access to the IAMAllowedPrincipals group.</p>
<p>The IAMAllowedPrincipals group includes all IAM users and roles that are allowed access to Data Catalog resources using IAM policies. The Super permission allows a principal to perform all operations supported by Lake Formation on the database or table on which it is granted. These settings provide access to Data Catalog resources and Amazon S3 locations controlled exclusively by IAM policies. Therefore, the individual permissions configured by Lake Formation are not considered, so you will remove the concessions already configured by the IAMAllowedPrincipals group, leaving only the Lake Formation settings.</p>
<ol>
<li>In the <strong>Databases</strong> menu, select the database <strong>dataset</strong>, and then select <strong>Actions</strong>. From the drop-down menu, select <strong>Revoke</strong>.</li>
<li>In the <strong>Principals</strong> section, select <strong>IAM users and roles</strong>, and then select the <strong>IAMAllowedPrincipals</strong> group as the user.</li>
<li>Under <strong>LF-Tags or catalog resources</strong>, select <strong>Named data catalog resources</strong>.</li>
<li>In the <strong>Database</strong> section, leave the <strong>dataset</strong> option selected.</li>
<li>Under <strong>Tables</strong>, select the following tables: <strong>bank</strong>, <strong>card</strong>, and <strong>customer</strong>.</li>
<li>In the <strong>Table permissions</strong> section, select <strong>Super</strong>.</li>
<li>Choose <strong>Revoke</strong>.</li>
</ol>
<p>Repeat the same steps for the <strong>dataset_encrypted</strong> and <strong>dataset_masked </strong>databases.</p>
<div id="attachment_33219" class="wp-caption aligncenter">
<img aria-describedby="caption-attachment-33219" src="https://infracom.com.sg/wp-content/uploads/2024/01/img12.jpg" alt="Figure 12: Revoke SUPER access to the IAMAllowedPrincipals group" width="780" class="size-full wp-image-33219" />
<p id="caption-attachment-33219" class="wp-caption-text">Figure 12: Revoke SUPER access to the <strong>IAMAllowedPrincipals</strong> group</p>
</div>
<p>You can confirm all user permissions on the <strong>Data Permissions</strong> page.</p>
<h2>Querying the data lake using Athena with different personas</h2>
<p>To validate the permissions of different personas, you use Athena to query the Amazon S3 data lake.</p>
<p>Ensure the query result location has been created as part of the CloudFormation stack (<span>secure-athena-query--</span>).</p>
<ol>
<li>Sign in to the Athena console with <strong>secure-lf-admin</strong> (use the password value for TestUserPassword from the CloudFormation stack) and verify that you are in the AWS Region used in the query result location.</li>
<li>On the navigation bar, choose <strong>Query editor</strong>.</li>
<li>Choose <strong>Setting</strong> to set up a query result location in Amazon S3, and then choose <strong>Browse S3</strong> and select the<strong> bucket secure-athena-query--.</strong></li>
<li>Run a SELECT query on the dataset.
<div class="hide-language">
<pre class="unlimited-height-code"><code>SELECT * FROM "dataset"."bank" limit 10;</code></pre>
</div> </li>
</ol>
<p>The <strong>secure-lf-admin</strong> user should see all tables in the <strong>dataset</strong> database and <strong>dcp</strong>. As for the banks <strong>dataset_encrypted</strong> and <strong>dataset_masked</strong>, the user should not have access to the tables.</p>
<div id="attachment_33221" class="wp-caption aligncenter">
<img aria-describedby="caption-attachment-33221" src="https://infracom.com.sg/wp-content/uploads/2024/01/img13.jpg" alt="Figure 13: Athena console with query results in clear text" width="780" class="size-full wp-image-33221" />
<p id="caption-attachment-33221" class="wp-caption-text">Figure 13: Athena console with query results in clear text</p>
</div>
<p>Finally, validate the <strong>secure-lf-data-scientist</strong> permissions.</p>
<ol>
<li>Sign in to the Athena console with <strong>secure-lf-data-scientist</strong> (use the password value for TestUserPassword from the CloudFormation stack) and verify that you are in the correct Region.</li>
<li>Run the following query:
<div class="hide-language">
<pre class="unlimited-height-code"><code>SELECT * FROM “dataset_masked”.”bank” limit 10;</code></pre>
</div> </li>
</ol>
<p>The user <strong>secure-lf-data-scientist</strong> will only be able to view all the columns in the database <strong>dataset_masked</strong>.</p>
<div id="attachment_33222" class="wp-caption aligncenter">
<img aria-describedby="caption-attachment-33222" src="https://infracom.com.sg/wp-content/uploads/2024/01/img14.jpg" alt="Figure 14: Athena query results with masked data" width="780" class="size-full wp-image-33222" />
<p id="caption-attachment-33222" class="wp-caption-text">Figure 14: Athena query results with masked data</p>
</div>
<p>Now, validate the <strong>secure-lf-business-analyst</strong> user permissions.</p>
<ol>
<li>Sign in to the Athena console as <strong>secure-lf-business-analyst</strong> (use the password value for TestUserPassword from the CloudFormation stack) and verify that you are in the correct Region.</li>
<li>Run a SELECT query on the dataset.
<div class="hide-language">
<pre class="unlimited-height-code"><code>SELECT * FROM “dataset_encrypted”.”card” limit 10;</code></pre>
</div>
<div id="attachment_33223" class="wp-caption aligncenter">
<img aria-describedby="caption-attachment-33223" src="https://infracom.com.sg/wp-content/uploads/2024/01/img15.jpg" alt="Figure 15: Validating secure-lf-business-analyst user permissions to query data" width="740" class="size-full wp-image-33223" />
<p id="caption-attachment-33223" class="wp-caption-text">Figure 15: Validating secure-lf-business-analyst user permissions to query data</p>
</div> </li>
</ol>
<p>The user secure-lf-business-analyst should only be able to view the card and customer tables of the dataset_encrypted database. In the table card, you will only have access to the cred_card_provider column and in the table Customer, you will have access only in the username, mail, and sex columns, as previously configured in Lake Formation.</p>
<h2>Cleaning up the environment</h2>
<p>After testing the solution, remove the resources you created to avoid unnecessary expenses.</p>
<ol>
<li>Open the Amazon S3 console.</li>
<li>Navigate to each of the following buckets and delete all the objects within:
<ol>
<li>dcp-assets--</li>
<li>dcp-athena--</li>
<li>dcp-glue--</li>
<li>dcp-macie--</li>
</ol> </li>
<li>Open the CloudFormation console.</li>
<li>Select the <strong>Stacks</strong> option from the navigation bar.</li>
<li>Select the stack that you created in <a href="https://aws.amazon.com/blogs/security/data-masking-and-granular-access-control-using-amazon-macie-and-aws-lake-formation/#deploy_the_cloudformation_template">Deploy the CloudFormation Template</a>.</li>
<li>Choose <strong>Delete</strong>, and then choose <strong>Delete Stack</strong> in the pop-up window.</li>
<li>If you also want to delete the bucket that was created, go to Amazon S3 and delete it from the console or by using the <a href="https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3api/delete-bucket.html" target="_blank" rel="noopener">AWS CLI</a>.</li>
<li>To remove the settings made in Lake Formation, go to the Lake Formation dashboard, and remove the data lake locales and the Lake Formation administrator.</li>
</ol>
<h2>Conclusion </h2>
<p>Now that the solution is implemented, you have an automated anonymization dataflow. This solution demonstrates how you can build a solution using AWS serverless solutions where you only pay for what you use and without worrying about infrastructure provisioning. In addition, this solution is customizable to meet other data protection requirements such as General Data Protection Law (LGPD) in Brazil, General Data Protection Regulation in Europe (GDPR), and the Association of Banks in Singapore (ABS) Cloud Computing Implementation Guide.</p>
<p>We used Macie to identify the sensitive data stored in Amazon S3 and AWS Glue to generate Macie reports to anonymize the sensitive data found. Finally, we used Lake Formation to implement fine-grained data access control to specific information and demonstrated how you can programmatically grant access to applications that need to work with unmasked data.</p>
<h2>Related links</h2>
<p> <br>If you have feedback about this post, submit comments in the<strong> Comments</strong> section below. If you have questions about this post, <a href="https://console.aws.amazon.com/support/home" target="_blank" rel="noopener noreferrer">contact AWS Support</a>.</p>
<p><strong>Want more AWS Security news? Follow us on <a title="Twitter" href="https://twitter.com/AWSsecurityinfo" target="_blank" rel="noopener noreferrer">Twitter</a>.</strong></p>
<!-- '"` -->