Infrastructure as code - only as good as you make it!
In my current project I work as Infrastructure architect rather than as application architect that I am due to shortage of resources. We had to set up a large Infrastructure based on IBM cloud, using three data centers, bare metal servers, IKS clusters, and various cloud services. Of course, infrastructure as code was set from the beginning, with a gitops repository to store the infrastructure code.
However, what I learned was, that just "infrastructure as code" is not a solution. It's only the beginning!
tl;dr
- define a clear toolset right from the beginning, with a clear purpose for each tool
- plan in the infrastructure for the toolset itself
- make sure your team knows the toolset well (enough) and uses it
- find the right level of abstraction on your (infrastructure) code - separate out specifics into variables, or logic constructs (loops, ifs)
- manage dependencies between tools - if one tool creates infrastructure, feed back necessary info to downstream tools automatically, not manually.
Multiple tools
In the project there were multiple tools defined for the deployment, for example:
- terraform: to create e.g. bare metal servers, IKS, and cloud services in the IBM cloud.
- ansible: install software on bare metal servers and other devices
- ArgoCD: deploy software onto IKS servers
- kubectl to manage the IKS clusters
In principle those are good tools. However, they were used by different teams, and so some of the configuration item types were deployed by multiple tools, as each team saw fit. Some configuration items were deployed with terraform by one team, and with ArgoCD by another team.
Developing an integrated deployment process with fixing all the quirks takes its time and should be done right from the start!
Infrastructure planning
To run the different tools in a production infrastructure you need appropriate infrastructure for the tools themselves.
Some tools were run from the personal workstation via the cloud VPN - but later we tightened security to run it via jump server (AKA bastion host). This caused problems as terraform is not well suited for that. Installing these tools on the jump server was also not an option, as we didn't want to have to copy the user credentials onto the jump host - making an even more valuable attack target thus...
Other tools like ArgoCD require an own server inside the infrastructure. The cost for this needs to be planned into the project cost.
Skills
We had experts for various types of technologies in the project. Those were experts in their own area, like databases, or network firewalls.
However, not all of them were accustomed to using e.g. an ansible playbook or a gitops repository. So, some part of the configuration was initially not managed in the gitops repository, and we had to migrate code there after the actual configuration. Considering the improvements that can be gained using IaC, I would in future projects avoid this and use IaC and gitops right from the start, and have the team educated if needed.
Level of abstraction
Infrastructure as code has two main advantages:
- Automation: if your infrastructure configuration is described as code, you can always re-create the infrastructure by re-running the code. This is especially useful in case of hardware failures (that we had), where you can re-create a configuration easily on new hardware.
- Repeatability: you can re-create an infrastructure multiple times, e.g. for test- or development environments. However, the IaC needs to be prepared for this.
If your Infrastructure code contains specifics like VLAN numbers, or host names that contain the stage (production vs. test) or datacenter, it is difficult if not impossible to easily re-use the Infrastructure code for other environments.
If this happens, then your infrastructure as code does not reap the promised benefits. It is just coding your installation scripts in a different format, with potentially higher infrastructure efforts for the IaC infrastructure.
To achieve real Repeatability, the level of abstraction needs to be such that you can
- abstract away specifics like datacenters, stage, etc. into variables
- use logic (like loops or ifclauses) to define differences between stages (e.g. scaling with loops, different types of services), or environments.
Manage Dependencies
Finally, and probably most important: your Infrastructure as code has dependencies between tools. For example, terraform creates some services, and thus for example either DNS names or IP addresses - and your ArgoCD needs to deploy coreDNS entries to those IP addresses: you definitely do not want to have to copy that over manually!
Make sure from the beginning you know the dependencies between your IaC tools and definitions and find a way how to handle them automatically!