Rancher with Terraform on CloudStack
I want to automate everything I can. Terraform is one of the automation tools I've checked out in the past but not thoroughly explored yet. After playing with AWS and Terraform for a while, I became worried I'd let some resources run wild, and they'd start billing my credit card like crazy. I got access to a CloudStack environment, which is fantastic, and decided to build a Rancher cluster against it with Terraform. I'm going to document my journey here in one or multiple posts.
Terraform is interesting. It allows you to create infrastructures from scratch while also removing every trace of its existence in seconds. Creating and destroying enables the flexibility to spin up a cluster when needed and break it down when finished.
This blog post will use Terraform to set up a Rancher server running on RKE, which we deploy on CloudStack.
And we're going to avoid having to do even a single task manually.
Creating the first VM
First things first, I needed a VM on CloudStack. After setting up API keys in my account and writing down the Terraform CloudStack provider's bare minimum, I added the first CloudStack instance resource.
To initialize Terraform and let it download the needed information and binaries to use the requested providers, we run
terraform init. All that's left to do to see something running then is to run
Cool! The first machine is running, as can be seen from the UI. You can find the IP in the UI or by running
terraform show. You'll likely get no response when you ping this machine. That's because the firewall still denies all traffic.
Setting up the security groups
To be able to access the machine, you'll have to add rules to the default security group. You can read more about them here. Adding rules can be done manually, but so can everything else, so we're using Terraform.
I've added the following security group and two security group rules in a new file called
security_groups.tf. Terraform will read all
*.tf files in the directory, so we don't have to worry about including them in
main.tf. The world can ping the machine with these rules, but only we can access SSH.
When Terraform creates a resource, it exports some attributes about it, like the ID of the security group. We can use the ID exported by the security group resource to refer to it from the security group rule. This way, CloudStack knows to which security group a ruleset belongs.
Don't worry about the order of creation. Terraform knows when references depend on each other and creates the needed resources first.
To make the machine use this security group, we must add it to its instance definition.
Note that changing the security group of an instance results in replacing the machine.
Once a VM is assigned to a security group, it remains in that group for its entire lifetime; you can not move a running VM from one security group to another.
Which I find annoying.
Applying the new configuration sets up a new machine with the changed security group ID. We can ping and access the SSH port but cannot yet login.
Adding keys to access the machine
To gain SSH access to the server we just created, we've to give CloudStack a keypair to include when bootstrapping the machine.
I've created an RSA key pair using
ssh-keygen -t rsa and added the following to the
main.tf. You can also use
~/.ssh/id_rsa.pub of course.
Adding the key after the machine is created should be possible, but something goes wrong every time I update it. I don't believe that feature is working correctly right now, so I decided to destroy and re-apply everything.
Now I'm able to SSH into the machine using my
test_rsa key. Let's set up the requirements for an RKE cluster.
Installing the required packages
I want to provision the server automatically with the needed docker packages. We could use Ansible for this or have a separate process to create perfect images with Packer, but let's stick to Terraform.
I've added the following to my
Terraform will not execute this directly. But don't worry, we don't have to fall back to manually logging in and running the commands. Lets just
terraform destroy and
terraform apply again :)
You'll see that Terraform tries to connect to SSH before the machine is finished starting up, but once it is, the preparation script from Rancher starts running immediately and installs Docker.
Setting up RKE
Terraform can set up an RKE cluster on the machine you just created using the RKE provider. This setup will be a single node RKE cluster. I've made another file named
rke.tf which contains the following:
I've also added the following to the
After which, you'll need to rerun
terraform init to fetch the required provider.
When you run
terraform apply now, you'll notice it says it wants to install an RKE cluster using Rancher's hyperkube version
v1.21.7-rancher1-1. To use a newer version, you'll have to update a dependency in the RKE provider, but I'll explain how to do that in another separate blog post.
When you run
terraform apply now, you'll notice an error:
Failed running cluster err:[network] Can't access KubeAPI port  on Control Plane host: 18.104.22.168
The RKE provider can't connect to the machine's port 6443. Let's fix that by changing the
Now RKE should install just fine. If not, destroy and re-apply. If you keep having random issues, check the available disk space and
Getting the kubeconfig.yaml
Of course, we want to access the RKE cluster from our terminal. We can see the kubeconfig yaml with
terraform show -json but it's highly inefficient.
We can automate it away using the
local_sensitive_file resource of Terraform provider hashicorp/local. Add the following to
And update the
main.tf with the new provider used:
Don't forget to run terraform init!
terraform apply writes the
kubeconfig.yaml to the local filesystem. You can now talk to the RKE cluster.
[email protected]:~/tests$ export KUBECONFIG=kubeconfig.yaml [email protected]:~/tests$ kubectl get nodes NAME STATUS ROLES AGE VERSION 22.214.171.124 Ready controlplane,etcd,worker 23m v1.21.7
Finally, we're ready to install Rancher after all the writing and five iterations of the RKE machine later. To do this, we'll be using the hashicorp/helm and rancher/rancher2 providers.
Add the providers to
main.tf. Also, define the location of the
Add ports 80 and 443 to
securitygroups.tf, else you won't be able to access the cluster and Terraform can't bootstrap it.
Certmanager will be a dependency of Rancher, so create a new file called
You can use
set to override values like you would in a
Next, create a file called
rancher.tf is one of the big terraform files. Here we'll use the Rancher provider. Here we define:
- The Helm installation of Rancher
- Where the Rancher cluster will be
- A bootstrap provider for Rancher
- An admin provider for Rancher
If you have set up the security groups open wide, you should choose a unique, strong password for the initial Rancher Helm deployment.
We override the Rancher version to get the latest patches, as this is not the default.
Using the alias attribute, we can make multiple providers. This way, we separate admin from bootstrap.
Once we run
terraform apply, we'll see the Rancher server creating.
We can access the generated password by running:
But we can also ask Terraform to write it down:
Unforeseen dependency problems
To test this script, we can now run
terraform destroy and
terraform apply. It will immediately tell you that
kubeconfig.yaml does not exist. The missing file happens due to Terraform not having initialized the cluster yet. Expanding your Terraform module step by step can cause unwanted dependency orders. Helm needs the kubeconfig file, while it's only created after the Helm provider is initialized. There is a lot more on this subject written in this GitHub issue.
To fix this problem, I've moved a lot of things around. I made three directories:
I've moved everything CloudStack related to Cloud and so forth.
Breaking apart the monolith
Having everything in one Terraform configuration causes dependency troubles. Besides that, you can't split privileges to a certain level of your infrastructure that way. Some people could manage CloudStack, some RKE and Rancher. Breaking the config into small pieces that only do what they're supposed to will create more flexibility. It looks a lot cleaner too.
I've changed the
main.tf and moved the RKE and Rancher provider stuff to the
main.tf of those respective directories. Another change is having Cloud write an output after each run to export the IP address to the others.
I've also changed all pointers to
test_rsa to point to
RKE now has to know what Cloud's data output was. To do this, we've got to add this small config to the
main.tf of RKE and make it aware of Cloud's data.
test_rsa path to
../test_rsa. You should do the same with the pub file.
The only change needed here is to point to the correct location of
kubeconfig.yaml which is
Testing it again
To test the complete module, we should enter the cloud directory first. Apply and move on to the next directory, RKE. Once RKE is set up, move to the Rancher directory and apply again.
With the installation of Rancher, we've come to the end of this blog post. The following blog post will be about provisioning multiple servers config efficiently and growing the Rancher instance. We'll also add an extra cluster to the Rancher instance.