This follows on from my previous article automating secret rotation in AWS. If you haven’t already, you can read that article here.
In software development, a Proof of Concept (PoC) is a great way to validate your idea, by building a small project based around a simple use-case. However, once you have completed the PoC, you need to decide what to do next.
Does the PoC meet the initial use-case? Is it worth spending any more time on the idea? Is there a better solution?
You have to consider all of these questions to ensure the feature you are going to spend time and effort on will deliver the value you expect to your users. In our case, we reviewed the PoC and found that it was a feature we wanted to develop further, but this, unfortunately, wasn’t as simple as just deploying the code from the PoC into Production and walking away happy.
In my previous article, I explained how we built the Proof of Concept to see if it would be possible to automate the rotation of our access tokens and secret keys. When we set up the use-case for this PoC, we were trying to rotate one secret value used in one CloudFormation stack. However, in a production environment, the use-case would be much more complicated than this. We could feasibly have a situation where we need to rotate six secret values, used by over a hundred CloudFormation stacks. In this scenario, our PoC would fail straight away, and I will quickly summarise the process to explain why:
- We update a secret value in Secrets Manager
- A CloudWatch Event then triggers a Step Function
- The Step Function first finds all the CloudFormation stacks, which need to be updated, and then update them one by one, by removing and then recreating the resources in the stack
Updating two values shortly after each other, using the above process fails because the Step Functions run simultaneously and as a result, try to update the same CloudFormation stacks at the same time as each other. As a result, the CloudFormation stacks ended up in a failed state, because the second Step Function would be attempting to delete resources which had already been removed by the first Step Function.
The easiest way to solve this problem was to wait in between updating each secret value to ensure the Step Function had updated all of the CloudFormation stacks successfully. However, this defeated the purpose of what we were ultimately trying to achieve by having an automated solution. What we needed was a way to ensure only one Step Function would be running at once. To accomplish this, we decided to use an SQS queue to control when a Step Function was triggered. The new process would be as follows:
- We update a secret value in Secrets Manager
- A CloudWatch Event then add a ticket to the queue
- A Lambda function checks the SQS queue for tickets
- A Lambda function picks up the ticket and checks to see if a Step Function is currently running. One of two things will then happen: a) If no Step Function is running the Lambda function will trigger one. b) If a Step Function is running, the ticket will be left in the SQS queue and be picked up on the next check
By adding a queue into the process, it meant we had complete confidence that, no matter how many times a secret value was updated, only one Step Function would run at once.
Now we had a feature which, following thorough testing, was ready to be deployed to our Production environment. However, this is not the end of the development for this feature. In software development, you should always try and embrace the idea of Inspect and Adapt. So let’s break down what I mean by this.
Inspecting is all about gathering feedback from your users. Finding out what works well, what doesn’t, and what is missing. We want to ensure that with everything we do, we are delivering value to our users and to do this; we must get their feedback.
Adapting is all about taking that feedback and using it to develop improvements to the product.
In the case of our Secret Key Rotation tool, through user feedback, we have identified three areas to improve:
- Currently, if a stack fails to update the entire process will exit and reset itself. For example, if stack 50 of 65 fails to update the next time the Step Function runs, all 65 stacks will need to be updated again. By improving the logic of the Step Function, we could skip stack 50 and continue to update the remaining CloudFormation stacks. Stack 50 could then be updated manually by the user, which would save time.
- We could generate a report detailing if any CloudFormation stacks failed to update. Therefore, saving the user from having to go through logs to find which stack failed.
- We should include the ability to send the report directly to a user, via platforms like Slack. With this update, the user could change a value in Secrets Manager and then carry on with their work. They would no longer need to keep checking back to ensure the process has executed successfully
Now, we could have held off deploying this product to Production until we had implemented these three updates. However, by releasing when we did, we were able to gather user feedback to see which improvement is most important. Once the next update is released, we will again go through the process of Inspect and Adapt to see how we can next deliver the most value to our users.
Software development is all about releasing little and often so that we can quickly realise value and generate user feedback. We use this feedback to help us make the right changes to our product maximise the experience of our users.
We are looking into the possibility of open-sourcing this project in the future, so if you would be interested in this or any of our other open-source projects please have a look at our Github page.
If you enjoyed our blog on access key rotation, check out these other Peak product articles: