Environment Setup & Prerequisites
   Network Planning
   Management VM Deployment
   Tool Requirements
Creating VM Templates in Proxmox
   Ubuntu Template for Management VM
   Talos Template
Building the Terraform Infrastructure Modules
   Talos Node Module
   Ubuntu Cloud-Init Module
Deploying Your Infrastructure
   Setting the Terraform Project Structure
   Complete Tutorial Environment Configuration
   Deploy Your Complete Infrastructure
Manual Cluster Bootstrap
Conclusion

Welcome to the next part of our Talos Linux on Proxmox series! In the previous article, we explored the theoretical foundations. Now we’ll build the automation infrastructure that makes our Talos cluster deployment reliable and repeatable.

This part focuses on the foundation: creating robust automation that provisions infrastructure and bootstraps a fully functional Kubernetes cluster. By the end, you’ll have a working 5-node Talos cluster with high-availability control plane and modern networking.

In the following article, we’ll add the platform components (storage, ingress, TLS) that make it production-ready.

Environment Setup & Prerequisites

Before we start building our automation, let’s set up the foundation. Our approach uses a dedicated management VM that serves as the command center for all cluster operations.

Network Planning

Network planning is one of the most environment-specific aspects of this deployment and will vary significantly based on your infrastructure setup. In our case, we have control over a Proxmox server with access to network configuration, DHCP reservations, and the ability to allocate specific IP ranges for our Kubernetes cluster. Your setup might be different as you may be working with existing DHCP servers, managed switches, or cloud environments with different networking constraints.

Taking that into consideration, plan your network architecture carefully before anything else. You’ll need to allocate IP addresses for several components within your existing network range, as a reference these are the components we needed to consider for our cluster:

# Network Architecture (Example: 192.168.1.0/24)
VIP (kube-vip):        <your-vip-ip>          # Example: 192.168.1.10
Control Planes:        <range-for-cp-nodes>   # Example: 192.168.1.24-26  
Workers:               <range-for-workers>    # Example: 192.168.1.72-73
Management VM:         <mgmt-vm-ip>           # Example: 192.168.1.200
MetalLB Pool:          <loadbalancer-range>   # Example: 192.168.1.120-130
Gateway:               <your-gateway-ip>      # Example: 192.168.1.254

Key Planning Considerations:

VIP Selection: The Virtual IP provides high availability for your control plane by allowing automatic failover between control plane nodes. Choose an IP that’s not in your DHCP range and won’t conflict with existing devices
Static Range: Your Talos nodes will transition from DHCP to static IPs during deployment. Ensure your chosen static IPs are excluded from DHCP allocation to prevent conflicts
MetalLB Pool: This range of IPs will be used by MetalLB to assign external IPs to LoadBalancer services in your cluster (we’ll configure this in the following article to this series)
Network Interface: Identify your network interface name (usually ens18 in Proxmox VMs, but may vary). You can check this by running ip addr inside a VM
Management VM: This dedicated Ubuntu VM serves as your operational command center and should have reliable network access for running automation tools

This allocation strategy prevents IP conflicts and provides clear service boundaries for different cluster components.

Management VM Deployment

The management VM is crucial to our automation strategy. It runs Ubuntu and hosts all the tools needed for cluster operations. While we recommend using a dedicated management VM for operational consistency and network proximity to the cluster, this approach is not strictly necessary. You could alternatively install all tools (kubectl, talosctl, Helm) on your local machine or any server that has network access to your cluster. The management VM approach provides benefits like consistent tooling, reduced network latency, and a dedicated operational environment. For the sake of this article, we will assume a management vm was created and all steps will be done through it. Here’s how we automate its creation using cloud-init:

Important Configuration Notes:

Update the script with your desired username, password, and SSH public key
Add any additional tools or configurations you need - this is your chance to configure everything upfront and save manual post-deployment setup
Save this file to /var/lib/vz/snippets/ on your Proxmox server so it can be referenced during VM deployment

# configs/cloud-init/management-vm-cloud-init.yml
#cloud-config
users:
  - name: <your-username>                    # Replace with your desired username
    sudo: ALL=(ALL) NOPASSWD:ALL
    shell: /bin/bash
    ssh_authorized_keys:
      - <your-ssh-public-key>                # Replace with your SSH public key

packages:
  - curl
  - wget
  - git
  - python3-pip
  - software-properties-common
  - apt-transport-https
  - ca-certificates
  - gnupg
  - lsb-release

runcmd:
  # Install Docker
  - curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
  - echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null
  - apt-get update
  - apt-get install -y docker-ce docker-ce-cli containerd.io docker-compose-plugin
  - usermod -aG docker <your-username>
  
  # Install kubectl
  - curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
  - install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
  
  # Install talosctl
  - curl -sL https://talos.dev/install | sh
  
  # Install Helm
  - curl https://baltocdn.com/helm/signing.asc | gpg --dearmor | tee /usr/share/keyrings/helm.gpg > /dev/null
  - echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/helm.gpg] https://baltocdn.com/helm/stable/debian/ all main" | tee /etc/apt/sources.list.d/helm-stable-debian.list
  - apt-get update
  - apt-get install -y helm

package_update: true
package_upgrade: true

This cloud-init configuration creates a fully equipped management environment with all necessary tools pre-installed.

Tool Requirements

Our automation stack requires several tools working together. These are split between your local machine (where you’ll run the deployment) and the management VM (which will be deployed in your Proxmox environment):

Required on Your Local Machine:

Terraform (1.5+): Infrastructure provisioning and VM deployment to Proxmox

Automatically Installed on Management VM (via cloud-init):

kubectl (1.28+): Kubernetes cluster management
talosctl (1.7+): Talos Linux node management
Helm (3.12+): Kubernetes package management
Docker (24.0+): Container runtime for development and troubleshooting

The management VM comes pre-configured with all Kubernetes-related tools through our cloud-init script, eliminating manual setup complexity. You’ll SSH into this VM to perform cluster operations after the initial infrastructure deployment.

Creating VM Templates in Proxmox

Before we can use Terraform to deploy VMs, we need to create reusable templates in Proxmox. Templates allow us to rapidly deploy consistent VM configurations and are essential for our automation approach.

Ubuntu Template for Management VM

Download Ubuntu Cloud Image:

# SSH to your Proxmox host and download the Ubuntu cloud image
wget https://cloud-images.ubuntu.com/releases/22.04/release/ubuntu-22.04-server-cloudimg-amd64.img

Create VM and Convert to Template:

# Create a new VM (ID 9000 in this example)
qm create 9000 --name ubuntu-22.04-template --memory 2048 --cores 2 --net0 virtio,bridge=vmbr0

# Import the disk image
qm importdisk 9000 ubuntu-22.04-server-cloudimg-amd64.img local-lvm

# Attach the disk to the VM
qm set 9000 --scsihw virtio-scsi-pci --scsi0 local-lvm:vm-9000-disk-0

# Add cloud-init drive
qm set 9000 --ide2 local-lvm:cloudinit

# Configure boot order
qm set 9000 --boot c --bootdisk scsi0

# Add serial console
qm set 9000 --serial0 socket --vga serial0

# Convert to template
qm template 9000

Talos Template

For the Talos template, we’ll use the generic Talos ISO rather than our custom factory image. While we could create a template with the custom image (which includes Longhorn extensions), keeping the template generic provides more flexibility. We’ll apply the custom image during the actual cluster configuration phase, allowing us to easily switch between different custom images or Talos versions without recreating templates.

Download Talos ISO:

# Download the latest Talos ISO to Proxmox local storage
cd /var/lib/vz/template/iso
wget https://github.com/siderolabs/talos/releases/download/v1.7.6/talos-amd64.iso

Create Talos Template:

# Create VM for Talos template (ID 9001 in this example)
qm create 9001 --name talos-template --memory 4096 --cores 2 --net0 virtio,bridge=vmbr0

# Add disk for Talos installation
qm set 9001 --scsi0 local-lvm:20

# Set SCSI controller
qm set 9001 --scsihw virtio-scsi-pci

# Configure boot and other settings
qm set 9001 --boot c --bootdisk scsi0
qm set 9001 --ostype l26
qm set 9001 --agent 0

# Convert to template
qm template 9001

Key Template Considerations:

Template IDs: Choose IDs that don’t conflict with existing VMs (9000+ range is commonly used for templates)
Resource Allocation: Templates define default CPU/memory, but these can be adjusted when cloning
Storage Location: Ensure your chosen storage (local-lvm in examples) has sufficient space
Network Configuration: Templates use the default bridge (vmbr0), but this can be modified per deployment

These templates serve as the foundation for our Terraform automation, allowing rapid deployment of consistent VM configurations.

Building the Terraform Infrastructure Modules

Our infrastructure automation is built around modular Terraform code that can provision both Talos nodes and management VMs. This modular approach allows us to reuse components across different environments (dev, staging, production) while maintaining consistency.

Talos Node Module

The Talos node module is the core building block for our Kubernetes cluster. It creates VMs optimized for running Talos Linux with specific configurations for both control plane and worker nodes.

# terraform/modules/talos-node/main.tf
resource "proxmox_vm_qemu" "talos_node" {
  name        = var.node_prefix
  target_node = var.target_node
  clone       = var.template_name
  full_clone  = true
  boot        = "c"
  bootdisk    = "scsi0"

  # Storage Configuration
  disk {
    slot    = "scsi0"
    type    = "disk"
    storage = "local-lvm"
    size    = "20G"
    discard = true
  }

  # Talos Installation ISO
  disk {
    slot = "ide2"
    type = "cdrom"
    iso  = "local:iso/talos-amd64.iso"
  }

  # VM Configuration
  os_type = "cloud-init"
  ciuser  = "talos"
  memory  = var.memory
  scsihw  = "virtio-scsi-pci"
  agent   = 0
  onboot  = true

  # CPU Configuration
  cpu {
    type    = "host"
    cores   = var.cores
    sockets = var.sockets
  }

  # Network Configuration
  network {
    id     = 0
    model  = "virtio"
    bridge = var.network_bridge
  }
}

Key Design Decisions:

Storage Strategy: We allocate 20GB for each node, which provides sufficient space for the Talos OS, container images, and temporary storage. The discard option enables TRIM support for better SSD performance.

CPU Configuration: Using type = "host" passes through the host CPU features to the VM, providing optimal performance. This is particularly important for Talos as it leverages hardware-specific optimizations.

Network Setup: The virtio network model offers the best performance for virtualized environments. The network bridge (vmbr0) connects VMs to the physical network.

Talos ISO Attachment: The ISO is attached to facilitate the initial Talos installation process. This gets replaced by the actual Talos configuration during deployment.

Complete Talos Node Module Files

terraform/modules/talos-node/variables.tf:

variable "node_prefix" {
  type        = string
  description = "Prefix for the VM name"
}

variable "target_node" {
  type        = string
  description = "Proxmox node where the VM will be created"
}

variable "template_name" {
  type        = string
  description = "Name of the Talos template to clone from"
}

variable "memory" {
  type        = number
  description = "Amount of memory in MB"
  default     = 4096
}

variable "cores" {
  type        = number
  description = "Number of CPU cores"
  default     = 2
}

variable "sockets" {
  type        = number
  description = "Number of CPU sockets"
  default     = 1
}

variable "network_bridge" {
  type        = string
  description = "Network bridge to connect to"
  default     = "vmbr0"
}

variable "disk_size" {
  type        = string
  description = "Size of the boot disk"
  default     = "20G"
}

variable "disk_storage" {
  type        = string
  description = "Storage location for disks"
  default     = "local-lvm"
}

terraform/modules/talos-node/outputs.tf:

output "vm_id" {
  description = "ID of the created VM"
  value       = proxmox_vm_qemu.talos_node.vmid
}

output "vm_name" {
  description = "Name of the created VM"
  value       = proxmox_vm_qemu.talos_node.name
}

output "vm_ipv4_address" {
  description = "IPv4 address of the VM (DHCP assigned)"
  value       = proxmox_vm_qemu.talos_node.default_ipv4_address
}

output "vm_network_interfaces" {
  description = "Network interface configuration"
  value       = proxmox_vm_qemu.talos_node.network
}

Ubuntu Cloud-Init Module

The Ubuntu cloud-init module creates our management VM, which serves as the operational command center for the entire cluster.

# terraform/modules/ubuntu-cloud-init-vm/main.tf
resource "proxmox_vm_qemu" "management_vm" {
  name        = var.vm_name
  target_node = var.target_node
  clone       = var.template_name
  full_clone  = true

  # VM Resources
  memory   = var.memory
  scsihw   = "virtio-scsi-pci"
  agent    = 1
  onboot   = true
  boot     = "c"
  bootdisk = "scsi0"

  cpu {
    type    = "host"
    cores   = var.cores
    sockets = var.sockets
  }

  # Storage Configuration
  disk {
    slot    = "scsi0"
    type    = "disk"
    storage = "local-lvm"
    size    = var.disk_size
    discard = true
  }

  # Cloud-init drive
  disk {
    slot    = "ide2"
    type    = "cloudinit"
    storage = "local-lvm"
  }

  # Network configuration with static IP
  network {
    id     = 0
    model  = "virtio"
    bridge = var.network_bridge
  }

  # Cloud-init configuration
  os_type   = "cloud-init"
  ipconfig0 = "ip=${var.ip_address}/24,gw=${var.gateway_ip}"
  
  # Reference to our cloud-init snippet
  cicustom = "user=local:snippets/management-vm-cloud-init.yml"

  # Serial console for debugging
  serial {
    id   = 0
    type = "socket"
  }
  
  vga {
    type = "serial0"
  }
}

Key Features:

Static IP Assignment: Unlike Talos nodes that start with DHCP, the management VM gets a static IP immediately through cloud-init configuration.

Cloud-Init Integration: The cicustom parameter references our cloud-init snippet stored in Proxmox’s local storage at /var/lib/vz/snippets/. This enables automated tool installation and user setup.

QEMU Guest Agent: Enabled (agent = 1) for better VM management and monitoring capabilities from Proxmox.

Serial Console: Configured for troubleshooting scenarios where network access might not be available.

Complete Ubuntu Cloud-Init Module Files

terraform/modules/ubuntu-cloud-init-vm/variables.tf:

variable "vm_name" {
  type        = string
  description = "Name of the VM"
}

variable "target_node" {
  type        = string
  description = "Proxmox node where the VM will be created"
}

variable "template_name" {
  type        = string
  description = "Name of the Ubuntu template to clone from"
}

variable "memory" {
  type        = number
  description = "Amount of memory in MB"
  default     = 4096
}

variable "cores" {
  type        = number
  description = "Number of CPU cores"
  default     = 2
}

variable "sockets" {
  type        = number
  description = "Number of CPU sockets"
  default     = 1
}

variable "disk_size" {
  type        = string
  description = "Size of the boot disk"
  default     = "20G"
}

variable "ip_address" {
  type        = string
  description = "Static IP address for the VM"
}

variable "gateway_ip" {
  type        = string
  description = "Gateway IP address"
}

variable "network_bridge" {
  type        = string
  description = "Network bridge to connect to"
  default     = "vmbr0"
}

variable "cloud_init_snippet" {
  type        = string
  description = "Name of the cloud-init snippet file"
  default     = "management-vm-cloud-init.yml"
}

variable "disk_storage" {
  type        = string
  description = "Storage location for disks"
  default     = "local-lvm"
}

terraform/modules/ubuntu-cloud-init-vm/outputs.tf:

output "vm_id" {
  description = "ID of the created VM"
  value       = proxmox_vm_qemu.management_vm.vmid
}

output "vm_name" {
  description = "Name of the created VM"
  value       = proxmox_vm_qemu.management_vm.name
}

output "vm_ipv4_address" {
  description = "IPv4 address of the VM"
  value       = var.ip_address
}

output "ssh_connection" {
  description = "SSH connection command"
  value       = "ssh ${var.vm_name}@${var.ip_address}"
}

Deploying Your Infrastructure

Setting the Terraform Project Structure

First, let’s outline the directory structure for your Talos cluster project:

talos-cluster-tutorial/
├── terraform/
│   ├── modules/
│   │   ├── talos-node/              # Reusable Talos VM module
│   │   │   ├── main.tf              # VM resource definition previously shown
│   │   │   ├── variables.tf         # Module inputs
│   │   │   └── outputs.tf           # Module outputs
│   │   └── ubuntu-cloud-init-vm/    # Reusable management previously shown
│   │       ├── main.tf              # Ubuntu VM with cloud-init
│   │       ├── variables.tf         # Module inputs
│   │       └── outputs.tf           # Module outputs
│   └── environments/
│       └── tutorial/                # Complete tutorial deployment
│           ├── main.tf              # Management VM + 5 Talos VMs
│           ├── providers.tf         # Proxmox provider configuration
│           ├── variables.tf         # All configuration variables
│           ├── outputs.tf           # All deployment outputs
│           └── terraform.tfvars     # Your settings (you'll create)

Complete Tutorial Environment Configuration

Now let’s create all the Terraform files for the tutorial environment that will deploy both the management VM and all 5 Talos cluster nodes.

terraform/environments/tutorial/providers.tf:

terraform {
  required_providers {
    proxmox = {
      source  = "telmate/proxmox"
      version = "3.0.2-rc01"
    }
  }
}

variable "pm_api_url" {
  type        = string
  description = "Proxmox API URL"
}

variable "pm_user" {
  type        = string
  description = "Proxmox username"
}

variable "pm_password" {
  type        = string
  description = "Proxmox password"
  sensitive   = true
}

provider "proxmox" {
  pm_api_url      = var.pm_api_url
  pm_user         = var.pm_user
  pm_password     = var.pm_password
  pm_tls_insecure = true
}

terraform/environments/tutorial/variables.tf:

# Proxmox Configuration
variable "target_node" {
  type        = string
  description = "Proxmox node name where VMs will be created"
}

variable "ubuntu_template" {
  type        = string
  description = "Name of the Ubuntu cloud-init template"
  default     = "ubuntu-22.04-template"
}

variable "talos_template" {
  type        = string
  description = "Name of the Talos template"
  default     = "talos-template"
}

# Network Configuration
variable "management_ip" {
  type        = string
  description = "Static IP for management VM"
  default     = "192.168.1.200"
}

variable "gateway_ip" {
  type        = string
  description = "Network gateway IP"
  default     = "192.168.1.254"
}

variable "network_bridge" {
  type        = string
  description = "Proxmox network bridge"
  default     = "vmbr0"
}

# Cluster Sizing
variable "controlplane_count" {
  type        = number
  description = "Number of control plane nodes"
  default     = 3
  validation {
    condition     = var.controlplane_count >= 1 && var.controlplane_count % 2 == 1
    error_message = "Control plane count must be an odd number (1, 3, 5, etc.) for proper etcd quorum."
  }
}

variable "worker_count" {
  type        = number
  description = "Number of worker nodes"
  default     = 2
}

# Resource Configuration
variable "management_vm_memory" {
  type        = number
  description = "Memory for management VM in MB"
  default     = 4096
}

variable "management_vm_cores" {
  type        = number
  description = "CPU cores for management VM"
  default     = 2
}

variable "controlplane_memory" {
  type        = number
  description = "Memory for control plane nodes in MB"
  default     = 4096
}

variable "controlplane_cores" {
  type        = number
  description = "CPU cores for control plane nodes"
  default     = 2
}

variable "worker_memory" {
  type        = number
  description = "Memory for worker nodes in MB"
  default     = 8192
}

variable "worker_cores" {
  type        = number
  description = "CPU cores for worker nodes"
  default     = 4
}

# Storage Configuration
variable "disk_storage" {
  type        = string
  description = "Proxmox storage location for VM disks"
  default     = "local-lvm"
}

terraform/environments/tutorial/main.tf:

# Management VM
module "management_vm" {
  source = "../../modules/ubuntu-cloud-init-vm"

  vm_name           = "cluster-mgmt"
  target_node       = var.target_node
  template_name     = var.ubuntu_template
  memory            = var.management_vm_memory
  cores             = var.management_vm_cores
  ip_address        = var.management_ip
  gateway_ip        = var.gateway_ip
  network_bridge    = var.network_bridge
  cloud_init_snippet = "management-vm-cloud-init.yml"
  disk_storage      = var.disk_storage
}

# Control Plane Nodes
module "control_plane_nodes" {
  count  = var.controlplane_count
  source = "../../modules/talos-node"

  node_prefix    = "tutorial-cp-${count.index + 1}"
  target_node    = var.target_node
  template_name  = var.talos_template
  memory         = var.controlplane_memory
  cores          = var.controlplane_cores
  network_bridge = var.network_bridge
  disk_storage   = var.disk_storage
}

# Worker Nodes
module "worker_nodes" {
  count  = var.worker_count
  source = "../../modules/talos-node"

  node_prefix    = "tutorial-worker-${count.index + 1}"
  target_node    = var.target_node
  template_name  = var.talos_template
  memory         = var.worker_memory
  cores          = var.worker_cores
  network_bridge = var.network_bridge
  disk_storage   = var.disk_storage
}

terraform/environments/tutorial/outputs.tf:

# Management VM Outputs
output "management_vm" {
  description = "Management VM information"
  value = {
    name       = module.management_vm.vm_name
    ip_address = module.management_vm.vm_ipv4_address
    ssh_command = module.management_vm.ssh_connection
  }
}

# Control Plane Outputs
output "control_plane_nodes" {
  description = "Control plane node information"
  value = {
    for i, node in module.control_plane_nodes : node.vm_name => {
      vm_id     = node.vm_id
      dhcp_ip   = node.vm_ipv4_address
      node_type = "controlplane"
    }
  }
}

# Worker Node Outputs
output "worker_nodes" {
  description = "Worker node information" 
  value = {
    for i, node in module.worker_nodes : node.vm_name => {
      vm_id     = node.vm_id
      dhcp_ip   = node.vm_ipv4_address
      node_type = "worker"
    }
  }
}

# Combined cluster output for easy reference
output "talos_cluster_nodes" {
  description = "All Talos cluster nodes"
  value = merge(
    {
      for i, node in module.control_plane_nodes : node.vm_name => {
        vm_id     = node.vm_id
        dhcp_ip   = node.vm_ipv4_address
        node_type = "controlplane"
      }
    },
    {
      for i, node in module.worker_nodes : node.vm_name => {
        vm_id     = node.vm_id
        dhcp_ip   = node.vm_ipv4_address
        node_type = "worker"
      }
    }
  )
}

Prerequisites and Setup

Before deploying, ensure you have the following prepared in your Proxmox environment:

VM Templates Required:

ubuntu-cloud: Ubuntu cloud-init template for the management VM
talos-template: Talos Linux template for cluster nodes

Cloud-Init Snippet Upload: Upload the management VM cloud-init configuration to Proxmox:

# Copy the cloud-init file to your Proxmox host
scp configs/cloud-init/management-vm-cloud-init.yml root@<proxmox-ip>:/var/lib/vz/snippets/

# Verify the file is accessible
ssh root@<proxmox-ip> "ls -la /var/lib/vz/snippets/management-vm-cloud-init.yml"

Proxmox User Setup: Ensure you have a Terraform user created (as covered in the previous article):

# Create user with API access
pveum user add terraform@pve --password 'your-secure-password'
pveum aclmod / -user terraform@pve -role PVEAdmin

Deploy Your Complete Infrastructure

Now we’ll deploy everything in one go: the management VM plus all 5 Talos cluster nodes.

Step 1: Navigate to Tutorial Environment

cd terraform/environments/tutorial

Step 2: Configure Your Deployment Create a terraform.tfvars file with your specific settings:

cat > terraform.tfvars << 'EOF'
# Proxmox connection settings
pm_api_url = "https://your-proxmox-ip:8006/api2/json"
pm_user = "terraform@pve"
pm_password = "your-secure-password"

# Infrastructure settings
target_node = "your-proxmox-node-name"  # e.g., "pve", "proxmox-01"
ubuntu_template = "ubuntu-cloud"        # Your Ubuntu template name
talos_template = "talos-template"       # Your Talos template name

# Network configuration
management_ip = "192.168.1.200"
gateway_ip = "192.168.1.254"

# Cluster sizing
controlplane_count = 3
worker_count = 2
EOF

Step 3: Initialize and Deploy Everything

# Initialize Terraform
terraform init

# Review what will be created (1 management VM + 5 cluster VMs)
terraform plan

# Deploy complete infrastructure
terraform apply

Expected Output:

Apply complete! Resources: 6 added, 0 changed, 0 destroyed.

Outputs:
management_vm = {
  "ip_address" = "192.168.1.200"
  "name" = "cluster-mgmt"
}

talos_cluster_nodes = {
  "tutorial-cp-1" = {
    "dhcp_ip" = "192.168.1.xxx"
    "node_type" = "controlplane"
  }
  "tutorial-cp-2" = {
    "dhcp_ip" = "192.168.1.xxx"
    "node_type" = "controlplane"
  }
  "tutorial-cp-3" = {
    "dhcp_ip" = "192.168.1.xxx"
    "node_type" = "controlplane"
  }
  "tutorial-worker-1" = {
    "dhcp_ip" = "192.168.1.xxx"
    "node_type" = "worker"
  }
  "tutorial-worker-2" = {
    "dhcp_ip" = "192.168.1.xxx"
    "node_type" = "worker"
  }
}

Step 4: Verify Your Infrastructure Check that all VMs are created and running:

# Test management VM access (wait 2-3 minutes for cloud-init)
ssh mentauro-admin@192.168.1.200

# Verify tools are installed
kubectl version --client
talosctl version --client
helm version

# From Proxmox web interface, verify all 6 VMs are running:
# - cluster-mgmt (management VM)
# - tutorial-cp-1, tutorial-cp-2, tutorial-cp-3 (control planes)
# - tutorial-worker-1, tutorial-worker-2 (workers)

Record DHCP IP Addresses: Note the DHCP IP addresses assigned to each Talos VM from the Terraform output - you’ll need these for the next part.

Next Steps

With your infrastructure deployed, you’re ready to move on to:

Configure Network Settings: Transform DHCP IPs to static configurations
Bootstrap Kubernetes: Initialize the control plane and join nodes
Install CNI: Deploy Cilium for pod networking
Validate Cluster: Ensure everything is working correctly

The Talos VMs are currently running but not yet configured as a Kubernetes cluster.

Manual Cluster Bootstrap

Now we’ll manually bootstrap your running VMs into a functional Kubernetes cluster. This approach gives you complete control over each step and helps you understand the Talos configuration process in detail.

Step 1: Prepare Your Environment

SSH to your management VM to perform the bootstrap operations:

# SSH to your management VM  
ssh <your-username>@192.168.1.200

# Create a working directory
mkdir -p ~/talos-cluster && cd ~/talos-cluster

Step 2: Gather Node Information

First, collect the DHCP IP addresses assigned to your Talos VMs. You can find these from your Terraform output:

# Review your terraform output (run this on your local machine)
cd terraform/environments/tutorial
terraform output talos_cluster_nodes

Create a reference file with your node mappings:

# Create node mapping file
cat > nodes.txt << 'EOF'
# Control Plane Nodes (replace xxx with actual DHCP IPs)
tutorial-cp-1: 192.168.1.xxx -> 192.168.1.24
tutorial-cp-2: 192.168.1.xxx -> 192.168.1.25  
tutorial-cp-3: 192.168.1.xxx -> 192.168.1.26

# Worker Nodes
tutorial-worker-1: 192.168.1.xxx -> 192.168.1.72
tutorial-worker-2: 192.168.1.xxx -> 192.168.1.73

# VIP: 192.168.1.10
# Gateway: 192.168.1.254
EOF

Step 3: Generate Base Talos Configuration

Generate the initial cluster configuration using the custom Talos image with Longhorn extensions. This command creates the foundational configurations that we’ll customize for each node:

# Generate base configuration
talosctl gen config tutorial-cluster https://192.168.1.10:6443 \
  --output-dir . \
  --install-image factory.talos.dev/installer/ce4c980550dd2ab1b17bbf2b08801c7eb59418eafe8f279833297925d67c7515:v1.10.4

What this does:

tutorial-cluster - Sets the cluster name
https://192.168.1.10:6443 - Configures the cluster endpoint to use our VIP
--install-image - Uses a custom Talos image pre-built with Longhorn storage extensions
--output-dir . - Outputs configuration files to current directory

This creates several files:

controlplane.yaml - Base control plane node configuration
worker.yaml - Base worker node configuration
talosconfig - Client configuration for talosctl commands

About the Custom Image: The custom image URL (factory.talos.dev/installer/ce4c980...) was created using Talos Image Factory, which builds custom Talos OS images with additional extensions. This specific image includes:

Base: Talos Linux v1.10.4
Extensions: Required linux extensions for Longhorn
Benefits: No need to install storage drivers post-deployment

To create your own custom image, visit factory.talos.dev, select your Talos version, choose extensions (like siderolabs/util-linux-tools), and generate the image URLs.

Step 4: Create Network Configuration Patches

Create network patches for each node to configure static IPs and the VIP. These patches modify the base configurations to set node-specific network settings, transforming nodes from DHCP to static IP configurations:

# Control plane 1 patch (example)
cat > tutorial-cp-1-patch.yaml << 'EOF'
machine:
  network:
    hostname: tutorial-cp-1
    interfaces:
      - interface: ens18
        addresses:
          - 192.168.1.24/24
        routes:
          - network: 0.0.0.0/0
            gateway: 192.168.1.254
        dhcp: false
        vip:
          ip: 192.168.1.10
EOF

What this patch does:

hostname - Sets the node’s hostname
interface: ens18 - Configures the primary network interface (typical for Proxmox VMs)
addresses - Assigns the static IP address with /24 subnet
routes - Sets the default gateway for internet access
dhcp: false - Disables DHCP and enables static IP configuration
vip - Configures the Virtual IP for high availability (control planes only)

Repeat this process for all nodes, creating patches with the appropriate hostnames and IP addresses:

tutorial-cp-2-patch.yaml (192.168.1.25)
tutorial-cp-3-patch.yaml (192.168.1.26)
tutorial-worker-1-patch.yaml (192.168.1.72) - Note: Workers don’t include the vip section but need kubelet.extraMounts for Longhorn
tutorial-worker-2-patch.yaml (192.168.1.73)

Worker patch example (without VIP, with Longhorn mounts):

# Worker patch template
cat > tutorial-worker-1-patch.yaml << 'EOF'
machine:
  network:
    hostname: tutorial-worker-1
    interfaces:
      - interface: ens18
        addresses:
          - 192.168.1.72/24
        routes:
          - network: 0.0.0.0/0
            gateway: 192.168.1.254
        dhcp: false
  kubelet:
    extraMounts:
      - destination: /var/lib/longhorn
        type: bind
        source: /var/lib/longhorn
        options:
          - bind
          - rshared
          - rw
EOF

Key differences for worker nodes:

No VIP section - Workers don’t participate in VIP management
Longhorn mounts - Prepares the /var/lib/longhorn directory for distributed storage
Bind mounts - Ensures storage persistence across container restarts

Step 5: Apply Network Patches

Generate the final configuration files for each node by applying the patches. This process merges the base configurations with node-specific network settings to create complete, ready-to-deploy configurations:

# Generate patched configuration for control plane 1 (example)
talosctl machineconfig patch controlplane.yaml \
  --patch @tutorial-cp-1-patch.yaml \
  > tutorial-cp-1.yaml

What this does:

Takes the base controlplane.yaml configuration
Applies the node-specific network patch using @tutorial-cp-1-patch.yaml
Outputs a complete configuration file tutorial-cp-1.yaml ready for deployment
The @ symbol tells talosctl to read the patch from a file

Repeat this process for all nodes:

Control plane nodes: patch controlplane.yaml with their respective patches
Worker nodes: patch worker.yaml with their respective patches

Example commands for all nodes:

# Control planes (use controlplane.yaml as base)
talosctl machineconfig patch controlplane.yaml --patch @tutorial-cp-2-patch.yaml > tutorial-cp-2.yaml
talosctl machineconfig patch controlplane.yaml --patch @tutorial-cp-3-patch.yaml > tutorial-cp-3.yaml

# Workers (use worker.yaml as base)
talosctl machineconfig patch worker.yaml --patch @tutorial-worker-1-patch.yaml > tutorial-worker-1.yaml
talosctl machineconfig patch worker.yaml --patch @tutorial-worker-2-patch.yaml > tutorial-worker-2.yaml

Step 6: Apply Configurations to Nodes

Apply the configurations to each node using their current DHCP IPs. This pushes the complete configurations to the nodes, triggering them to reboot and adopt their new static IP settings:

# Apply configuration to control plane 1 (example - replace xxx with actual DHCP IP)
talosctl apply-config --endpoints 192.168.1.xxx \
  --nodes 192.168.1.xxx \
  --file tutorial-cp-1.yaml --insecure

What this does:

--endpoints - Specifies the current DHCP IP to connect to the node
--nodes - Confirms which node to apply the config to (same as endpoint)
--file - Points to the complete configuration file for this node
--insecure - Bypasses certificate validation (needed for initial configuration)

Repeat this process for all nodes, replacing 192.168.1.xxx with the actual DHCP IP from your Terraform output:

# Apply to remaining control planes
talosctl apply-config --endpoints 192.168.1.xxx --nodes 192.168.1.xxx --file tutorial-cp-2.yaml --insecure
talosctl apply-config --endpoints 192.168.1.xxx --nodes 192.168.1.xxx --file tutorial-cp-3.yaml --insecure

# Apply to workers
talosctl apply-config --endpoints 192.168.1.xxx --nodes 192.168.1.xxx --file tutorial-worker-1.yaml --insecure
talosctl apply-config --endpoints 192.168.1.xxx --nodes 192.168.1.xxx --file tutorial-worker-2.yaml --insecure

Wait for all nodes to reboot with their new static IP configurations:

# Wait for nodes to reboot with static IPs (about 2-3 minutes)
echo "Waiting for nodes to reboot and configure static IPs..."
sleep 120

What happens during this wait:

Nodes receive their new configurations and validate them
Each node reboots to apply the static IP settings
Network interfaces are reconfigured from DHCP to static
Nodes become available on their new static IP addresses

Step 7: Bootstrap the Cluster

Bootstrap the first control plane node to initialize Kubernetes. This creates the initial etcd cluster and starts the Kubernetes control plane:

# Set the talosconfig
export TALOSCONFIG=./talosconfig

# Bootstrap the first control plane node (using static IP)
talosctl bootstrap --endpoints 192.168.1.24 --nodes 192.168.1.24

# Wait for the cluster to initialize and VIP to become active
echo "Waiting for cluster initialization and VIP activation..."
sleep 90

What this does:

export TALOSCONFIG - Tells talosctl which config file to use for authentication
bootstrap - Initializes the first control plane node and creates the etcd cluster
--endpoints 192.168.1.24 - Connects to the first control plane using its static IP
The node starts Kubernetes services (kube-apiserver, etcd, kube-scheduler, kube-controller-manager)
Other control plane nodes automatically join the cluster
The VIP (192.168.1.10) becomes active for high availability

Step 8: Generate and Configure kubectl Access

Generate the kubeconfig file and verify cluster connectivity. This creates the credentials needed to manage your Kubernetes cluster:

# Generate kubeconfig using the VIP endpoint
talosctl kubeconfig --endpoints 192.168.1.10 --nodes 192.168.1.10

# Set the kubeconfig for current session
export KUBECONFIG=./kubeconfig

# Verify cluster is running (this may take a few minutes)
kubectl get nodes

What this does:

talosctl kubeconfig - Generates a kubeconfig file with cluster access credentials
--endpoints 192.168.1.10 - Uses the VIP for load-balanced access to the control plane
export KUBECONFIG - Sets kubectl to use the generated config file
kubectl get nodes - Verifies all nodes are visible to the Kubernetes API

At this point, you should see all your nodes, but they will be in “NotReady” status because we haven’t installed a CNI yet.

Step 9: Install Cilium CNI

Install Cilium for pod networking and observability. This provides the network fabric that allows pods to communicate across the cluster:

# Add Cilium Helm repository
helm repo add cilium https://helm.cilium.io/
helm repo update

# Install Cilium with eBPF networking
helm install cilium cilium/cilium --version 1.16.5 \
  --namespace kube-system \
  --set kubeProxyReplacement=true \
  --set k8sServiceHost=192.168.1.10 \
  --set k8sServicePort=6443 \
  --set hubble.relay.enabled=true \
  --set hubble.ui.enabled=true \
  --set operator.replicas=1 \
  --set securityContext.privileged=true \
  --set cgroup.autoMount.enabled=false \
  --set cgroup.hostRoot=/sys/fs/cgroup

# Wait for Cilium to be ready
echo "Waiting for Cilium to initialize..."
kubectl wait --for=condition=ready pod -l k8s-app=cilium -n kube-system --timeout=300s

What this does:

helm repo add - Adds the official Cilium Helm chart repository
kubeProxyReplacement=true - Uses eBPF to replace kube-proxy for better performance
k8sServiceHost=192.168.1.10 - Configures Cilium to use our VIP for API access
hubble.relay.enabled=true - Enables network observability and flow monitoring
securityContext.privileged=true - Required for eBPF programs to function
cgroup settings - Talos-specific configurations for the immutable filesystem
kubectl wait - Ensures all Cilium pods are ready before continuing

Step 10: Verify Your Cluster

Check that everything is working correctly. These commands validate that your cluster is fully operational:

# Check node status
kubectl get nodes

# Check system pods
kubectl get pods -n kube-system

# Check Cilium status (if cilium CLI is installed)
cilium status --wait

What these commands verify:

kubectl get nodes - Confirms all nodes are in “Ready” status with proper roles
kubectl get pods -n kube-system - Ensures all system components are running
cilium status --wait - Validates CNI networking and eBPF functionality

Expected Result:

NAME               STATUS   ROLES           AGE   VERSION
tutorial-cp-1      Ready    control-plane   8m    v1.10.4
tutorial-cp-2      Ready    control-plane   8m    v1.10.4
tutorial-cp-3      Ready    control-plane   8m    v1.10.4
tutorial-worker-1  Ready    <none>          7m    v1.10.4
tutorial-worker-2  Ready    <none>          7m    v1.10.4

Key indicators of success:

All nodes show “Ready” status
Control plane nodes have the “control-plane” role
Worker nodes are available for workload scheduling
System pods are running and healthy

Conclusion

Albeit a long process, we now have a fully functional, basic Kubernetes cluster running on Talos OS. We covered everything from writing Terraform modules to deploy the nodes, to running all the configuration steps necessary to bootstrap the cluster.

There’s still more work to do: we need to remove manual steps by automating the processes we handled manually, and continue enhancing the cluster with a load balancer, certificate management, and other recommended components. We’ll explore these next steps in future posts, there is still plenty more to learn and even more fun to have.

Talos Linux on Proxmox: Infrastructure Automation and Cluster Bootstrap

Table of Contents

Environment Setup & Prerequisites

Network Planning

Management VM Deployment

Tool Requirements

Creating VM Templates in Proxmox

Ubuntu Template for Management VM

Talos Template

Building the Terraform Infrastructure Modules

Talos Node Module

Key Design Decisions:

Complete Talos Node Module Files

Ubuntu Cloud-Init Module

Key Features:

Complete Ubuntu Cloud-Init Module Files

Deploying Your Infrastructure

Setting the Terraform Project Structure

Complete Tutorial Environment Configuration

Prerequisites and Setup

Deploy Your Complete Infrastructure

Next Steps

Manual Cluster Bootstrap

Step 1: Prepare Your Environment

Step 2: Gather Node Information

Step 3: Generate Base Talos Configuration

Step 4: Create Network Configuration Patches

Step 5: Apply Network Patches

Step 6: Apply Configurations to Nodes

Step 7: Bootstrap the Cluster

Step 8: Generate and Configure kubectl Access

Step 9: Install Cilium CNI

Step 10: Verify Your Cluster

Conclusion