Rancher with Terraform on CloudStack

I want to automate everything I can. Terraform is one of the automation tools I've checked out in the past but not thoroughly explored yet. After playing with AWS and Terraform for a while, I became worried I'd let some resources run wild, and they'd start billing my credit card like crazy. I got access to a CloudStack environment, which is fantastic, and decided to build a Rancher cluster against it with Terraform. I'm going to document my journey here in one or multiple posts.

Terraform is interesting. It allows you to create infrastructures from scratch while also removing every trace of its existence in seconds. Creating and destroying enables the flexibility to spin up a cluster when needed and break it down when finished.

This blog post will use Terraform to set up a Rancher server running on RKE, which we deploy on CloudStack.

And we're going to avoid having to do even a single task manually.

Creating the first VM

First things first, I needed a VM on CloudStack. After setting up API keys in my account and writing down the Terraform CloudStack provider's bare minimum, I added the first CloudStack instance resource.

terraform {
  required_providers {
    cloudstack = {
      source = "cloudstack/cloudstack"
      version = "0.4.0"
    }
  }
}

provider "cloudstack" {
  api_url    = "https://cloud.url/zone/api"
  api_key    = "api_key"
  secret_key = "secret_key"
}

resource "cloudstack_instance" "local_nodes" {
  name             = "local-node"
  service_offering = "VM 4G/4C"
  network_id       = "g56cf51f-93ab-2351-a222-9c9525dc8533"
  template         = "Ubuntu 20.04"
  zone             = "zone.ams.net"
  root_disk_size   = 20 # You'll need at least 10GB of space
  expunge          = true # This removes the VM completely after destroy
}
The Terraform configuration

To initialize Terraform and let it download the needed information and binaries to use the requested providers, we run terraform init. All that's left to do to see something running then is to run terraform apply.

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # cloudstack_instance.test will be created
  + resource "cloudstack_instance" "test" {
      + display_name     = (known after apply)
      + expunge          = true
      + group            = (known after apply)
      + id               = (known after apply)
      + ip_address       = (known after apply)
      + name             = "test"
      + network_id       = "g56cf51f-93ab-2351-a222-9c9525dc8533"
      + project          = (known after apply)
      + root_disk_size   = 20
      + service_offering = "VM 4G/4C"
      + start_vm         = true
      + tags             = (known after apply)
      + template         = "Ubuntu 20.04"
      + zone             = "zone.ams.net"
    }

Plan: 1 to add, 0 to change, 0 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

cloudstack_instance.test: Creating...
cloudstack_instance.test: Still creating... [10s elapsed]
cloudstack_instance.test: Creation complete after 10s [id=9c9525dc8533-2592-450d-a774-g56cf51f]
Terraform apply result

Cool! The first machine is running, as can be seen from the UI. You can find the IP in the UI or by running terraform show. You'll likely get no response when you ping this machine. That's because the firewall still denies all traffic.

Setting up the security groups

To be able to access the machine, you'll have to add rules to the default security group. You can read more about them here. Adding rules can be done manually, but so can everything else, so we're using Terraform.

I've added the following security group and two security group rules in a new file called security_groups.tf. Terraform will read all *.tf files in the directory, so we don't have to worry about including them in main.tf. The world can ping the machine with these rules, but only we can access SSH.

resource "cloudstack_security_group" "Default-SG" {
  name        = "Default-SG"
  description = "Test SG for terraform tests"
}

resource "cloudstack_security_group_rule" "Default-SG-ICMP-Ruleset" {
  security_group_id = cloudstack_security_group.Default-SG.id

  rule {
    cidr_list = ["0.0.0.0/0"]
    protocol  = "icmp"
    icmp_code = -1
    icmp_type = -1
  }
}

resource "cloudstack_security_group_rule" "Default-SG-Home-SSH-Ruleset" {
  security_group_id = cloudstack_security_group.Default-SG.id

  rule {
    cidr_list = ["1.2.3.4/32"] # Your IP address
    protocol  = "tcp"
    ports     = ["22"]
  }
}
The security groups

When Terraform creates a resource, it exports some attributes about it, like the ID of the security group. We can use the ID exported by the security group resource to refer to it from the security group rule. This way, CloudStack knows to which security group a ruleset belongs.

Don't worry about the order of creation. Terraform knows when references depend on each other and creates the needed resources first.

To make the machine use this security group, we must add it to its instance definition.

resource "cloudstack_instance" "test" {
...
  expunge            = true
  security_group_ids = [cloudstack_security_group.Default-SG.id]
  
  connection {
    type        = "ssh"
...
  }
Adding the security_group_ids

Note that changing the security group of an instance results in replacing the machine.

Once a VM is assigned to a security group, it remains in that group for its entire lifetime; you can not move a running VM from one security group to another.

Which I find annoying.

Terraform will perform the following actions:

  # cloudstack_instance.test must be replaced
-/+ resource "cloudstack_instance" "test" {
      ~ display_name       = "test" -> (known after apply)
      + group              = (known after apply)
      ~ id                 = "9c9525dc8533-2592-450d-a774-g56cf51f" -> (known after apply)
      ~ ip_address         = "5.6.7.8" -> (known after apply)
        name               = "test"
      + project            = (known after apply)
      ~ root_disk_size     = 8 -> (known after apply)
      + security_group_ids = [
          + "ef6c8192-2795-440c-8774-1be8a969afd1",
        ] # forces replacement
      ~ tags               = {} -> (known after apply)
        # (7 unchanged attributes hidden)
    }

Plan: 1 to add, 0 to change, 1 to destroy.
Terraform apply

Applying the new configuration sets up a new machine with the changed security group ID. We can ping and access the SSH port but cannot yet login.

Adding keys to access the machine

To gain SSH access to the server we just created, we've to give CloudStack a keypair to include when bootstrapping the machine.

I've created an RSA key pair using ssh-keygen -t rsa and added the following to the main.tf. You can also use ~/.ssh/id_rsa.pub of course.

resource "cloudstack_instance" "test" {
...
  zone               = "zone.ams.net"
  keypair            = cloudstack_ssh_keypair.testkey.id # This line
  expunge            = true
  security_group_ids = [cloudstack_security_group.Default-SG.id]
...
}

resource "cloudstack_ssh_keypair" "testkey" {
  name       = "testkey"
  public_key = "${file("test_rsa.pub")}"
}
Add SSH keys to CloudStack.

Adding the key after the machine is created should be possible, but something goes wrong every time I update it. I don't believe that feature is working correctly right now, so I decided to destroy and re-apply everything.

Now I'm able to SSH into the machine using my test_rsa key. Let's set up the requirements for an RKE cluster.

Installing the required packages

I want to provision the server automatically with the needed docker packages. We could use Ansible for this or have a separate process to create perfect images with Packer, but let's stick to Terraform.

I've added the following to my main.tf

  connection {
    type        = "ssh"
    user        = "root"
    private_key = file("test_rsa")
    host        = self.ip_address
  }
  
  provisioner "remote-exec" {
    inline  = ["curl https://releases.rancher.com/install-docker/20.10.sh | sh"]
  }
Adding remote-exec provisioner

Terraform will not execute this directly. But don't worry, we don't have to fall back to manually logging in and running the commands. Lets just terraform destroy and terraform apply again :)

You'll see that Terraform tries to connect to SSH before the machine is finished starting up, but once it is, the preparation script from Rancher starts running immediately and installs Docker.

Setting up RKE

Terraform can set up an RKE cluster on the machine you just created using the RKE provider. This setup will be a single node RKE cluster. I've made another file named rke.tf which contains the following:

provider "rke" {
  debug = true
  log_file = "rke_debug.log"
}

resource "rke_cluster" "cluster_local" {
  nodes {
    address = cloudstack_instance.test.ip_address
    user    = "root"
    role    = ["controlplane", "worker", "etcd"]
    ssh_key = file("test_rsa")
  }
}
Adding RKE provider config

I've also added the following to the main.tf :

terraform {
  required_providers {
    cloudstack = {
      source = "cloudstack/cloudstack"
      version = "0.4.0"
    }
    # This part
    rke = {
      source = "rancher/rke"
      version = "1.3.0"
    }
  }
}
Adding RKE provider download

After which, you'll need to rerun terraform init to fetch the required provider.

When you run terraform apply now, you'll notice it says it wants to install an RKE cluster using Rancher's hyperkube version v1.21.7-rancher1-1. To use a newer version, you'll have to update a dependency in the RKE provider, but I'll explain how to do that in another separate blog post.

When you run terraform apply now, you'll notice an error:

Failed running cluster err:[network] Can't access KubeAPI port [6443] on Control Plane host: 4.5.6.7

The RKE provider can't connect to the machine's port 6443. Let's fix that by changing the Home-Ruleset in security_groups.tf:

resource "cloudstack_security_group_rule" "Default-SG-Home-Ruleset" {
  security_group_id = cloudstack_security_group.Default-SG.id

  rule {
    cidr_list = ["1.2.3.4/32"]
    protocol  = "tcp"
    ports     = ["22", "6443"]
  }
}
Adding 6443 to rules

Now RKE should install just fine. If not, destroy and re-apply. If you keep having random issues, check the available disk space and rke_debug.log.

Getting the kubeconfig.yaml

Of course, we want to access the RKE cluster from our terminal. We can see the kubeconfig yaml with terraform show -json but it's highly inefficient.

marco@DESKTOP-WS:~/tests$ terraform show -json | jq '.values["root_module"]["resources"][] | select(.address == "rke_cluster.cluster_local") | .values.kube_config_yaml' -r
apiVersion: v1
kind: Config
clusters:
- cluster:
    api-version: v1
    certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0F....
Extracting using JSON

We can automate it away using the local_sensitive_file resource of Terraform provider hashicorp/local. Add the following to rke.tf:

resource "local_sensitive_file" "kube_config_yaml" {
  content = rke_cluster.cluster_local.kube_config_yaml
  filename = "kubeconfig.yaml"
}
local_sensitive_file

And update the main.tf with the new provider used:

terraform {
  required_providers {
...
    local = {
      source = "hashicorp/local"
      version = "2.2.2"
    }
  }
}
main.tf Adding the local provider

Don't forget to run terraform init!

Running terraform apply writes the kubeconfig.yaml to the local filesystem. You can now talk to the RKE cluster.

marco@DESKTOP-WS:~/tests$ export KUBECONFIG=kubeconfig.yaml
marco@DESKTOP-WS:~/tests$ kubectl get nodes
NAME            STATUS   ROLES                      AGE   VERSION
5.6.7.8   Ready    controlplane,etcd,worker   23m   v1.21.7

Installing Rancher

Finally, we're ready to install Rancher after all the writing and five iterations of the RKE machine later. To do this, we'll be using the hashicorp/helm and rancher/rancher2 providers.

Add the providers to main.tf. Also, define the location of the kubeconfig.yaml

terraform {
  required_providers {
...
    rancher2 = {
      source = "rancher/rancher2"
      version = "1.22.2"
    }
    helm = {
      source = "hashicorp/helm"
      version = "2.4.1"
    }
  }
}

provider "helm" {
  kubernetes {
    config_path = "kubeconfig.yaml"
  }
}
Adding helm config to main.tf

Add ports 80 and 443 to securitygroups.tf, else you won't be able to access the cluster and Terraform can't bootstrap it.

Certmanager will be a dependency of Rancher, so create a new file called certmanager.tf :

resource "helm_release" "cert_manager" {
  name             = "cert-manager"
  namespace        = "cert-manager"
  repository       = "https://charts.jetstack.io"
  chart            = "cert-manager"
  version          = "1.5.3"

  wait             = true
  create_namespace = true
  force_update     = true
  replace          = true

  set {
    name  = "installCRDs"
    value = true
  }
}
Adding helm install to certmanager.tf

You can use set to override values like you would in a values.yaml

Next, create a file called rancher.tf :

resource "helm_release" "rancher" {
  name = "rancher"
  namespace = "cattle-system"
  chart = "rancher"
  repository = "https://releases.rancher.com/server-charts/latest"
  depends_on = [helm_release.cert_manager]

  wait             = true
  create_namespace = true
  force_update     = true
  replace          = true

  set {
    name  = "hostname"
    value = "rancher.debugdomain.com"
  }

  set {
    name  = "ingress.tls.source"
    value = "rancher"
  }
  
  set {
    name  = "bootstrapPassword"
    value = "A-Random-Password"
  }

  set {
    name  = "rancherImageTag"
    value = "v2.6.3-patch1"
  }
}

provider "rancher2" {
  alias = "bootstrap"

  api_url   = "https://rancher.debugdomain.com"
  insecure  = true
  bootstrap = true
}

# Create a new rancher2_bootstrap using bootstrap provider config
resource "rancher2_bootstrap" "admin" {
  provider = rancher2.bootstrap
  depends_on = [helm_release.rancher]
  initial_password = "A-Random-Password"
  # New password will be generated and saved in statefile
  telemetry = false
}

# Provider config for admin
provider "rancher2" {
  alias = "admin"

  api_url = rancher2_bootstrap.admin.url
  token_key = rancher2_bootstrap.admin.token
  insecure = true
}
All Rancher.tf config

The rancher.tf is one of the big terraform files. Here we'll use the Rancher provider. Here we define:

  • The Helm installation of Rancher
  • Where the Rancher cluster will be
  • A bootstrap provider for Rancher
  • An admin provider for Rancher

If you have set up the security groups open wide, you should choose a unique, strong password for the initial Rancher Helm deployment.

We override the Rancher version to get the latest patches, as this is not the default.

Using the alias attribute, we can make multiple providers. This way, we separate admin from bootstrap.

Once we run terraform apply, we'll see the Rancher server creating.

We can access the generated password by running:

terraform show -json \
  | jq '.values["root_module"]["resources"][] \
  | select(.address == "rancher2_bootstrap.admin") | .values.password' -r
Using JSON to extract password

But we can also ask Terraform to write it down:

resource "local_sensitive_file" "rancher-password" {
  content = rancher2_bootstrap.admin.password
  filename = "rancher_password"
}
Write password to sensitive file

Unforeseen dependency problems

To test this script, we can now run terraform destroy and terraform apply. It will immediately tell you that kubeconfig.yaml does not exist. The missing file happens due to Terraform not having initialized the cluster yet. Expanding your Terraform module step by step can cause unwanted dependency orders. Helm needs the kubeconfig file, while it's only created after the Helm provider is initialized. There is a lot more on this subject written in this GitHub issue.

To fix this problem, I've moved a lot of things around. I made three directories:

  • Cloud
  • RKE
  • Rancher

I've moved everything CloudStack related to Cloud and so forth.

.
├── cloud
│   ├── instances.tf
│   ├── main.tf
│   └── security_groups.tf
├── rancher
│   ├── certmanager.tf
│   ├── main.tf
│   └── rancher.tf
├── rke
│   ├── main.tf
│   ├── rke.tf
│   └── rke_debug.log
├── test_rsa
└── test_rsa.pub
Directory tree

Breaking apart the monolith

Having everything in one Terraform configuration causes dependency troubles. Besides that, you can't split privileges to a certain level of your infrastructure that way. Some people could manage CloudStack, some RKE and Rancher. Breaking the config into small pieces that only do what they're supposed to will create more flexibility. It looks a lot cleaner too.

Cloud

I've changed the main.tf and moved the RKE and Rancher provider stuff to the main.tf of those respective directories. Another change is having Cloud write an output after each run to export the IP address to the others.

output "ip_address" {
  value = cloudstack_instance.test.ip_address
}
Output info of Terraform run and save in terraform.state

I've also changed all pointers to test_rsa to point to ../test_rsa.

RKE

RKE now has to know what Cloud's data output was. To do this, we've got to add this small config to the main.tf of RKE and make it aware of Cloud's data.

data "terraform_remote_state" "cloud" {
  backend = "local" 
  config = {
    path    = "../cloud/terraform.tfstate"
  }
}
Read output data from remote terraform.state

Change the cloudstack_instance.test.ip_address to data.terraform_remote_state.cloud.outputs.ip_address in rke.tf.

Change the test_rsa path to ../test_rsa. You should do the same with the pub file.

Rancher

The only change needed here is to point to the correct location of kubeconfig.yaml which is ../rke/kubeconfig.yaml.

Testing it again

To test the complete module, we should enter the cloud directory first. Apply and move on to the next directory, RKE. Once RKE is set up, move to the Rancher directory and apply again.  

Conclusion

With the installation of Rancher, we've come to the end of this blog post. The following blog post will be about provisioning multiple servers config efficiently and growing the Rancher instance. We'll also add an extra cluster to the Rancher instance.