How to implement client certificate revocation list checks at scale with API Gateway
As you design your Amazon API Gateway applications to rely on mutual certificate authentication (mTLS), you need to consider how your application will verify the revocation status of a client certificate. In your design, you should account for the performance and availability of your verification mechanism to make sure that your application endpoints perform reliably.
<p>In this blog post, I demonstrate an architecture that will help you on your journey to implement custom revocation checks against your certificate revocation list (CRL) for API Gateway. You will also learn advanced <a href="https://aws.amazon.com/s3/" target="_blank" rel="noopener">Amazon Simple Storage Service (Amazon S3)</a> and <a href="https://aws.amazon.com/lambda/" target="_blank" rel="noopener">AWS Lambda</a> techniques to achieve higher performance and scalability.</p>
<h2>Choosing the right certificate verification method</h2>
<p>One of your first considerations is whether to use a <a href="https://www.rfc-editor.org/rfc/pdfrfc/rfc5280.txt.pdf" target="_blank" rel="noopener">CRL</a> or the <a href="https://www.rfc-editor.org/rfc/pdfrfc/rfc6960.txt.pdf" target="_blank" rel="noopener">Online Certificate Status Protocol (OCSP)</a>, if your certificate authority (CA) offers this option. For an in-depth analysis of these two options, see my earlier blog post, <a href="https://aws.amazon.com/blogs/security/choosing-the-right-certificate-revocation-method-in-acm-private-ca/" target="_blank" rel="noopener">Choosing the right certificate revocation method in ACM Private CA</a>. In that post, I demonstrated that OCSP is a good choice when your application can tolerate high latency or a failure for certificate verification due to TLS service-to-OCSP connectivity. When you rely on mutual TLS authentication in a high-rate transactional environment, increased latency or OCSP reachability failures may affect your application. We strongly recommend that you validate the revocation status of your mutual TLS certificates. Verifying your client certificate status against the CRL is the correct approach for certificate verification if you require reliability and lower, predictable latency. A potential exception to this approach is the use case of <a href="https://aws.amazon.com/private-ca/" target="_blank" rel="noopener">AWS Certificate Manager Private Certificate Authority (AWS Private CA)</a> with an OCSP responder hosted on <a href="https://aws.amazon.com/cloudfront/" target="_blank" rel="noopener">AWS CloudFront</a>.</p>
<p>With an AWS Private CA OCSP responder hosted on CloudFront, you can reduce the risks of network and latency challenges by relying on communication between AWS native services. While this post focuses on the solution that targets CRLs originating from any CA, if you use AWS Private CA with an OCSP responder, you should consider generating an OCSP request in your <a href="https://docs.aws.amazon.com/apigateway/latest/developerguide/apigateway-use-lambda-authorizer.html" target="_blank" rel="noopener">Lambda authorizer</a>.</p>
<h2>Mutual authentication with API Gateway</h2>
<p><a href="https://aws.amazon.com/blogs/compute/introducing-mutual-tls-authentication-for-amazon-api-gateway/" target="_blank" rel="noopener">API Gateway mutual TLS authentication</a> (mTLS) requires you to define a root of trust that will contain your certificate authority public key. During the mutual TLS authentication process, API Gateway performs the undifferentiated heavy lifting by offloading the certificate authentication and negotiation process. During the authentication process, API Gateway validates that your certificate is trusted, has valid dates, and uses a supported algorithm. Additionally, you can refer to the <a href="https://docs.aws.amazon.com/apigateway/latest/developerguide/rest-api-mutual-tls.html" target="_blank" rel="noopener">API Gateway documentation</a> and related <a href="https://aws.amazon.com/blogs/compute/introducing-mutual-tls-authentication-for-amazon-api-gateway/" target="_blank" rel="noopener">blog post</a> for details about the mutual TLS authentication process on API Gateway.</p>
<h2>Implementing mTLS certificate verification for API Gateway</h2>
<p>In the remainder of this blog post, I’ll describe the architecture for a scalable implementation of a client certificate verification mechanism against a CRL on your API Gateway.</p>
<p>The certificate CRL verification process presented here relies on a custom Lambda authorizer that validates the certificate revocation status against the CRL. The Lambda authorizer caches CRL data to optimize the query time for subsequent requests and allows you to define custom business logic that could go beyond CRL verification. For example, you could include other, just-in-time authorization decisions as a part of your evaluation logic.</p>
<h3>Implementation mechanisms</h3>
<p>This section describes the implementation mechanisms that help you create a high-performing extension to the API Gateway mutual TLS authentication process.</p>
<h4>Data repository for your certificate revocation list</h4>
<p>API Gateway mutual TLS configuration uses Amazon S3 as a repository for your root of trust. The design for this sample implementation extends the use of S3 buckets to store your CRL and the public key for the certificate authority that signed the CRL.</p>
<p>We strongly recommend that you maintain an updated CRL and verify its signature before data processing. This process is automatic if you use AWS Private CA, because AWS Private CA will update your CRL automatically on revocation. AWS Private CA also allows you to retrieve the CA’s public key by using an API call.</p>
<h4>Certificate validation</h4>
<p>My sample implementation architecture uses the API Gateway <a href="https://docs.aws.amazon.com/apigateway/latest/developerguide/apigateway-use-lambda-authorizer.html" target="_blank" rel="noopener">Lambda authorizer</a> to validate the serial number of the client certificate used in the mutual TLS authentication session against the list of serial numbers present in the CRL you publish to the S3 bucket. In the process, the API Gateway custom authorizer will read the client certificate serial number, read and validate the CRL’s digital signature, search for the client’s certificate serial number within the CRL, and return the authorization policy based on the findings.</p>
<h4>Optimizing for performance</h4>
<p>The mechanisms that enable a predictable, low-latency performance are CRL preprocessing and caching. Your CRL is an ASN.1 data structure that requires a relatively high computing time for processing. Preprocessing your CRL into a simple-to-parse data structure reduces the computational cost you would otherwise incur for every validation; caching the CRL will help you reduce the validation latency and improve predictability further.</p>
<h2>Performance optimizations</h2>
<p>The process of parsing and validating CRLs is computationally expensive. In the case of large CRL files, parsing the CRL in the Lambda authorizer on every request can result in high latency and timeouts. To improve latency and reduce compute costs, this solution optimizes for performance by preprocessing the CRL and implementing function-level caching.</p>
<h3>Preprocessing and generation of a cached CRL file</h3>
<p>The first optimization happens when S3 receives a new CRL object. As shown in Figure 1, the S3 PutObject event invokes a preprocessing Lambda that validates the signature of your uploaded CRL and decodes its ASN.1 format. The output of the preprocessing Lambda function is the list of the revoked certificate serial numbers from the CRL, in a data structure that is simpler to read by your programming language of choice, and that won’t require extensive parsing by your Lambda authorizer. The asynchronous approach mitigates the impact of CRL processing on your API Gateway workload.</p>
<div id="attachment_32452" class="wp-caption alignnone">
<img aria-describedby="caption-attachment-32452" src="https://infracom.com.sg/wp-content/uploads/2023/12/Figure1-1.jpg" alt="Figure 1: Sample implementation flow of the pre-processing component" width="1297" height="582" class="size-full wp-image-32452">
<p id="caption-attachment-32452" class="wp-caption-text">Figure 1: Sample implementation flow of the pre-processing component</p>
</div>
<h3>Client certificate lookup in a CRL</h3>
<p>The optimization happens as part of your Lambda authorizer that retrieves the preprocessed CRL data generated from the first step and searches through the data structure for your client certificate serial number. If the Lambda authorizer finds your client’s certificate serial number in the CRL, the authorization request fails, and the Lambda authorizer generates a “Deny” policy. Searching through a read-optimized data structure prepared by your preprocessing step is the second optimization that reduces the lookup time and the compute requirements.</p>
<h3>Function-level caching</h3>
<p>Because of the preprocessing, the Lambda authorizer code no longer needs to perform the expensive operation of decoding the ASN.1 data structures of the original CRL; however, network transfer latency will remain and may impact your application.</p>
<p>To improve performance, and as a third optimization, the Lambda service retains the runtime environment for a recently-run function for a non-deterministic period of time. If the function is invoked again during this time period, the Lambda function doesn’t have to initialize and can start running immediately. This is called a <em>warm start</em>. Function-level caching takes advantage of this warm start to hold the CRL data structure in memory persistently between function invocations so the Lambda function doesn’t have to download the preprocessed CRL data structure from S3 on every request.</p>
<p>The duration of the Lambda container’s warm state depends on multiple factors, such as usage patterns and parallel requests processed by your function. If, in your case, API use is infrequent or its usage pattern is spiky, <a href="https://docs.aws.amazon.com/lambda/latest/dg/provisioned-concurrency.htm" target="_blank" rel="noopener">pre-provisioned concurrency</a> is another technique that can further reduce your Lambda startup times and the duration of your warm cache. Although provisioned concurrency does have additional costs, I recommend you evaluate its benefits for your specific environment. You can also check out the blog dedicated to this topic, <a href="https://aws.amazon.com/blogs/security/how-to-implement-client-certificate-revocation-list-checks-at-scale-with-api-gateway/%5CUsers%5Carthumne%5Cdocuments%5Cpublic-engagements%5Cblogs%5Capi-gw-crl%5CScheduling%20AWS%20Lambda%20Provisioned%20Concurrency%20for%20recurring%20peak%20usage" target="_blank" rel="noopener">Scheduling AWS Lambda Provisioned Concurrency for recurring peak usage</a>.</p>
<p>To validate that the Lambda authorizer has the latest copy of the CRL data structure, the S3 ETag value is used to determine if the object has changed. The preprocessed CRL object’s ETag value is stored as a Lambda global variable, so its value is retained between invocations in the same runtime environment. When API Gateway invokes the Lambda authorizer, the function checks for existing global preprocessed CRL data structure and ETag variables. The process will only retrieve a read-optimized CRL when the ETag is absent, or its value differs from the ETag of the preprocessed CRL object in S3.</p>
<p>Figure 2 demonstrates this process flow.</p>
<div id="attachment_32451" class="wp-caption alignnone">
<img aria-describedby="caption-attachment-32451" loading="lazy" src="https://infracom.com.sg/wp-content/uploads/2023/12/Figure2-1.jpg" alt="Figure 2: Sample implementation flow for the Lambda authorizer component" width="1777" height="905" class="size-full wp-image-32451">
<p id="caption-attachment-32451" class="wp-caption-text">Figure 2: Sample implementation flow for the Lambda authorizer component</p>
</div>
<p>In summary, you will have a Lambda container with a persistent in-memory lookup data structure for your CRL by doing the following:</p>
<ul>
<li>Asynchronously start your preprocessing workflow by using the S3 PutObject event so you can generate and store your preprocessed CRL data structure in a separate S3 object.</li>
<li>Read the preprocessed CRL from S3 and its ETag value and store both values in <a href="https://docs.aws.amazon.com/lambda/latest/operatorguide/global-scope.html" target="_blank" rel="noopener">global variables</a>.</li>
<li>Compare the value of the ETag stored in your global variables to the current ETag value of the preprocessed CRL S3 object, to reduce unnecessary downloads if the current ETag value of your S3 object is the same as the previous value.</li>
<li>We recommend that you avoid using built-in API Gateway Lambda authorizer result caching, because the status of your certificate might change, and your authorization decision would rest on out-of-date verification results.</li>
<li>Consider setting a <a href="https://docs.aws.amazon.com/lambda/latest/operatorguide/reserved-concurrency.html" target="_blank" rel="noopener">reserved concurrency</a> for your CRL verification function so that API Gateway can invoke your function even if the overall capacity for your account in your AWS Region is exhausted.</li>
</ul>
<p>The sample implementation flow diagram in Figure 3 demonstrates the overall architecture of the solution.</p>
<div id="attachment_32450" class="wp-caption alignnone">
<img aria-describedby="caption-attachment-32450" loading="lazy" src="https://infracom.com.sg/wp-content/uploads/2023/12/Figure3.jpg" alt="Figure 3: Sample implementation flow for the overall CRL verification architecture" width="1350" height="711" class="size-full wp-image-32450">
<p id="caption-attachment-32450" class="wp-caption-text">Figure 3: Sample implementation flow for the overall CRL verification architecture</p>
</div>
<p>The workflow for the solution overall is as follows:</p>
<ol>
<li>An administrator publishes a CRL and its signing CA’s certificate to their non-public S3 bucket, which is accessible by the Lambda authorizer and preprocessor roles.</li>
<li>An S3 event invokes the Lambda preprocessor to run upon CRL upload. The function retrieves the CRL from S3, validates its signature against the issuing certificate, and parses the CRL.</li>
<li>The preprocessor Lambda stores the results in an S3 bucket with a name in the form <em></em>.cache.json.</li>
<li>A TLS client requests an mTLS connection and supplies its certificate.</li>
<li>API Gateway completes mTLS negotiation and invokes the Lambda authorizer.</li>
<li>The Lambda authorizer function parses the client’s mTLS certificate, retrieves the cached CRL object, and searches the object for the serial number of the client’s certificate.</li>
<li>The authorizer function returns a deny policy if the certificate is revoked or in error.</li>
<li>API Gateway, if authorized, proceeds with the integrated function or denies the client’s request.</li>
</ol>
<h2>Conclusion</h2>
<p>In this post, I presented a design for validating your API Gateway mutual TLS client certificates against a CRL, with support for extra-large certificate revocation files. This approach will help you align with the best security practices for validating client certificates and use advanced S3 access and Lambda caching techniques to minimize time and latency for validation.</p>
<p>If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the <a href="https://repost.aws/topics/TAEEfW2o7QS4SOLeZqACq9jA/security-identity-compliance" rel="noopener" target="_blank">AWS Security, Identity, and Compliance re:Post</a> or <a href="https://console.aws.amazon.com/support/home" rel="noopener" target="_blank">contact AWS Support</a>.</p>
<!-- '"` -->