Rancher with Terraform on CloudStack
I want to automate everything I can. Terraform is one of the automation tools I've checked out in the past but not thoroughly explored yet. After playing with AWS and Terraform for a while, I became worried I'd let some resources run wild, and they'd start billing my credit card like crazy. I got access to a CloudStack environment, which is fantastic, and decided to build a Rancher cluster against it with Terraform. I'm going to document my journey here in one or multiple posts.
Terraform is interesting. It allows you to create infrastructures from scratch while also removing every trace of its existence in seconds. Creating and destroying enables the flexibility to spin up a cluster when needed and break it down when finished.
This blog post will use Terraform to set up a Rancher server running on RKE, which we deploy on CloudStack.
And we're going to avoid having to do even a single task manually.
Creating the first VM
First things first, I needed a VM on CloudStack. After setting up API keys in my account and writing down the Terraform CloudStack provider's bare minimum, I added the first CloudStack instance resource.
To initialize Terraform and let it download the needed information and binaries to use the requested providers, we run terraform init
. All that's left to do to see something running then is to run terraform apply
.
Cool! The first machine is running, as can be seen from the UI. You can find the IP in the UI or by running terraform show
. You'll likely get no response when you ping this machine. That's because the firewall still denies all traffic.
Setting up the security groups
To be able to access the machine, you'll have to add rules to the default security group. You can read more about them here. Adding rules can be done manually, but so can everything else, so we're using Terraform.
I've added the following security group and two security group rules in a new file called security_groups.tf
. Terraform will read all *.tf files
in the directory, so we don't have to worry about including them in main.tf
. The world can ping the machine with these rules, but only we can access SSH.
When Terraform creates a resource, it exports some attributes about it, like the ID of the security group. We can use the ID exported by the security group resource to refer to it from the security group rule. This way, CloudStack knows to which security group a ruleset belongs.
Don't worry about the order of creation. Terraform knows when references depend on each other and creates the needed resources first.
To make the machine use this security group, we must add it to its instance definition.
Note that changing the security group of an instance results in replacing the machine.
Once a VM is assigned to a security group, it remains in that group for its entire lifetime; you can not move a running VM from one security group to another.
Which I find annoying.
Applying the new configuration sets up a new machine with the changed security group ID. We can ping and access the SSH port but cannot yet login.
Adding keys to access the machine
To gain SSH access to the server we just created, we've to give CloudStack a keypair to include when bootstrapping the machine.
I've created an RSA key pair using ssh-keygen -t rsa
and added the following to the main.tf
. You can also use ~/.ssh/id_rsa.pub
of course.
Adding the key after the machine is created should be possible, but something goes wrong every time I update it. I don't believe that feature is working correctly right now, so I decided to destroy and re-apply everything.
Now I'm able to SSH into the machine using my test_rsa
key. Let's set up the requirements for an RKE cluster.
Installing the required packages
I want to provision the server automatically with the needed docker packages. We could use Ansible for this or have a separate process to create perfect images with Packer, but let's stick to Terraform.
I've added the following to my main.tf
Terraform will not execute this directly. But don't worry, we don't have to fall back to manually logging in and running the commands. Lets just terraform destroy
and terraform apply
again :)
You'll see that Terraform tries to connect to SSH before the machine is finished starting up, but once it is, the preparation script from Rancher starts running immediately and installs Docker.
Setting up RKE
Terraform can set up an RKE cluster on the machine you just created using the RKE provider. This setup will be a single node RKE cluster. I've made another file named rke.tf
which contains the following:
I've also added the following to the main.tf
:
After which, you'll need to rerun terraform init
to fetch the required provider.
When you run terraform apply
now, you'll notice it says it wants to install an RKE cluster using Rancher's hyperkube version v1.21.7-rancher1-1
. To use a newer version, you'll have to update a dependency in the RKE provider, but I'll explain how to do that in another separate blog post.
When you run terraform apply
now, you'll notice an error:
Failed running cluster err:[network] Can't access KubeAPI port [6443] on Control Plane host: 4.5.6.7
The RKE provider can't connect to the machine's port 6443. Let's fix that by changing the Home-Ruleset
in security_groups.tf
:
Now RKE should install just fine. If not, destroy and re-apply. If you keep having random issues, check the available disk space and rke_debug.log
.
Getting the kubeconfig.yaml
Of course, we want to access the RKE cluster from our terminal. We can see the kubeconfig yaml with terraform show -json
but it's highly inefficient.
We can automate it away using the local_sensitive_file
resource of Terraform provider hashicorp/local. Add the following to rke.tf
:
And update the main.tf
with the new provider used:
Don't forget to run terraform init!
Running terraform apply
writes the kubeconfig.yaml
to the local filesystem. You can now talk to the RKE cluster.
marco@DESKTOP-WS:~/tests$ export KUBECONFIG=kubeconfig.yaml
marco@DESKTOP-WS:~/tests$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
5.6.7.8 Ready controlplane,etcd,worker 23m v1.21.7
Installing Rancher
Finally, we're ready to install Rancher after all the writing and five iterations of the RKE machine later. To do this, we'll be using the hashicorp/helm and rancher/rancher2 providers.
Add the providers to main.tf
. Also, define the location of the kubeconfig.yaml
Add ports 80 and 443 to securitygroups.tf
, else you won't be able to access the cluster and Terraform can't bootstrap it.
Certmanager will be a dependency of Rancher, so create a new file called certmanager.tf
:
You can use set
to override values like you would in a values.yaml
Next, create a file called rancher.tf
:
The rancher.tf
is one of the big terraform files. Here we'll use the Rancher provider. Here we define:
- The Helm installation of Rancher
- Where the Rancher cluster will be
- A bootstrap provider for Rancher
- An admin provider for Rancher
If you have set up the security groups open wide, you should choose a unique, strong password for the initial Rancher Helm deployment.
We override the Rancher version to get the latest patches, as this is not the default.
Using the alias attribute, we can make multiple providers. This way, we separate admin from bootstrap.
Once we run terraform apply
, we'll see the Rancher server creating.
We can access the generated password by running:
But we can also ask Terraform to write it down:
Unforeseen dependency problems
To test this script, we can now run terraform destroy
and terraform apply
. It will immediately tell you that kubeconfig.yaml
does not exist. The missing file happens due to Terraform not having initialized the cluster yet. Expanding your Terraform module step by step can cause unwanted dependency orders. Helm needs the kubeconfig file, while it's only created after the Helm provider is initialized. There is a lot more on this subject written in this GitHub issue.
To fix this problem, I've moved a lot of things around. I made three directories:
- Cloud
- RKE
- Rancher
I've moved everything CloudStack related to Cloud and so forth.
Breaking apart the monolith
Having everything in one Terraform configuration causes dependency troubles. Besides that, you can't split privileges to a certain level of your infrastructure that way. Some people could manage CloudStack, some RKE and Rancher. Breaking the config into small pieces that only do what they're supposed to will create more flexibility. It looks a lot cleaner too.
Cloud
I've changed the main.tf
and moved the RKE and Rancher provider stuff to the main.tf
of those respective directories. Another change is having Cloud write an output after each run to export the IP address to the others.
I've also changed all pointers to test_rsa
to point to ../test_rsa
.
RKE
RKE now has to know what Cloud's data output was. To do this, we've got to add this small config to the main.tf
of RKE and make it aware of Cloud's data.
Change the cloudstack_instance.test.ip_address
to data.terraform_remote_state.cloud.outputs.ip_address
in rke.tf
.
Change the test_rsa
path to ../test_rsa
. You should do the same with the pub file.
Rancher
The only change needed here is to point to the correct location of kubeconfig.yaml
which is ../rke/kubeconfig.yaml
.
Testing it again
To test the complete module, we should enter the cloud directory first. Apply and move on to the next directory, RKE. Once RKE is set up, move to the Rancher directory and apply again.
Conclusion
With the installation of Rancher, we've come to the end of this blog post. The following blog post will be about provisioning multiple servers config efficiently and growing the Rancher instance. We'll also add an extra cluster to the Rancher instance.