Continuous Delivery: From Bot to Drone (part 2)
In part one, I covered “Regression tests” and “Building manageable artifacts”. Here are the remaining steps.
3. Life Cycle management
Using AWS and EC2 Container Service (ECS) makes life easier because it manages your container fleet OOTB.
With a simple CloudFormation stack, it’s possible to create and update the required infrastructure that hosts the bot with every code change. It’s fully automated.
ECS requires EC2 instances to create the container fleet. Instead of using regular EC2 instances I decided to use Spot Instances as:
- They are easier to manage. The Spot management console (beta) recently launched by AWS makes it even easier to manage instances and scale your cluster.
- Obviously the second reason is price. Spot instances are (sometimes a lot) cheaper than traditional instances. Try it out and be pleasantly surprised!
However, using spot instances implies that your service should be replaceable. At any given time the instance might be reclaimed by AWS. This is why it’s important to put them into an AutoScalingGroup. For this particular bot, we are not really worried about the service’s resiliency. But, in more critical use cases, your services should be able to handle this situation gracefully.
4. Configuration management
This one is very tricky. Any one in the software industry who’s dealt with Configuration management admits that it’s not as easy as it sounds. For our particular use case, there are only a couple of things must be carefully handled: the Slack token and AWS credentials. Other not-confidential or static requirements, I left as a simple configuration file within the code base.
AWS credentials are the trickiest – I don’t want them committed to the source code repository nor does using Docker environment variables sound secure. As it’s AWS and many people have this problem, there is a simple solution for that:
- AWS provides metadata (“user data”) to your EC2 instances which are read at run time and utilised.
- Libraries like Boto3 access these metadata to authenticate against AWS. So I don’t even need to write any code to do that. Sounds legit.
Using metadata and Boto3 means I have precise control over my app permissions and I don’t need to provide AWS credentials explicitly to the app!
What about our Slack token? This I left it as a docker environment variable since it’s not so critical. Additionally, we store the Slack token in the CircleCI build environment not GitHub.
Finally, how does a stateless, re-spawnable Drone securely access Elastic Search urls in AWS? AWS doesn’t let you access the Elastic Search API anonymously as logs may contain sensitive data like IP addresses and other customer data.
Fortunately there is a solution for that too – URL Signing. Surfing through AWS documentation didn’t initially provide me a single SDK call, so I started to write my own signing code which soon got messy. Someone must have already done this, so I turned back to Google and finally found aws-requests-auth
.
Hooking up Boto3 with aws-requests-auth
solved our URL Signing problem. It was a struggle but a nice learning nevertheless 🙂
You might think that the CloudFormation stack JSON config file could simply be committed with its parameters. But, you’d be wrong. Remember our Slack token? As this has to be provided to the ECS service and ultimately to the python daemon, we must inject it into our CloudFormation parameters at build time with a simple sed
command. This was probably the ugliest step but we ran out of time.
There are many more things in the TODO list for configuration management: using key-vaults or other services to provide credentials at run time are a broader topic to explore.
5. Deployment
Perhaps the easiest part when you have all others properly in place. Just wire them together and flick the switch. Then everything is automated.
Because I leveraged ECS to run the service (Task), deployment is fairly simple. After pushing the new Docker image to ECR, I alter the appropriate CloudFormation parameters and update the stack. CloudFormation is smart enough to only update the task and, after successfully creating the new Drone, it deletes the old one. Thus, even the downtime is nearly invisible to the end user.
Of course, this is a polling service not a web service and there are no load balancers in place. The resulting blue/green deployment will have two tasks actively polling the Slack queue which might not be desirable in some production requests. I assume in those situations a locking mechanism for the queue should be in place to avoid consuming duplicate messages.
Bonus:
I decided to hook our new Drone up to the BotLibre ALICE bot for more interactive conversations. After all, giving feedback to a bot is much easier than a real human being, so let’s make it fun at the same time!