Cost Engineering and FinOps

Name: Digital FTEs: Engineering — Achieving 10× Productivity
Author: Muhammad Usman Akbar

Your observability stack is running. Prometheus collects metrics, Grafana renders dashboards, and your SLO alerts fire when error budgets burn too fast. But there is one question your current stack cannot answer: How much does this cost?

You receive the monthly cloud bill. $47,000. Your finance team asks: "Which team is responsible for this spend? Why did costs jump 35% from last month? Are we paying for resources nobody uses?" Without cost visibility, you are debugging finances the same way you once debugged production — in the dark, hoping patterns emerge from spreadsheets.

This lesson teaches you to answer those questions with the same precision you now apply to latency percentiles. You will install OpenCost to monitor Kubernetes spending in real time, apply cost allocation labels so every dollar traces to an owner, identify idle resources wasting budget, use VPA recommendations to right-size workloads, and schedule non-production environments to run only when needed.

FinOps: The Discipline of Cloud Cost Management

FinOps is not about cutting costs. FinOps is about building discipline and transparency so engineering speed and financial control work together instead of against each other.

The FinOps Foundation defines three lifecycle phases:

Phase	Focus	Questions Answered
Inform	Visibility	Where is money being spent? Who owns each cost?
Optimize	Efficiency	Are resources right-sized? Can we use discounts?
Operate	Governance	Are budgets respected? Do processes prevent waste?

Organizations that implement FinOps practices discover significant untapped savings. The goal is not spending less — it is spending intentionally.

The Six FinOps Principles

FinOps practices are built on six governing principles:

Collaboration — Finance, engineering, and product teams work together in near real-time. Silos destroy cost visibility.
Ownership — Each team owns its cloud usage and manages that usage against budget. No ownership means no accountability.
Business Value Drives Decisions — Decisions optimize for value, not just raw spending. A $10,000 service generating $100,000 revenue differs from a $10,000 service generating $5,000.
Centralized FinOps Team — A dedicated team handles rate optimization and negotiations, freeing engineers to focus on their work.
Embrace Variable Costs — Cloud flexibility is an opportunity for innovation, not just a liability to manage.
Real-Time Reporting — Accurate, real-time visibility drives all other practices. Monthly bills arrive too late to change behavior.

These principles guide every practice in this lesson.

Installing OpenCost for Kubernetes Cost Monitoring

OpenCost is a vendor-neutral open source project for measuring and allocating cloud infrastructure costs in real time. It is a CNCF Incubating project supported by AWS, Microsoft, Google, and the broader cloud-native community.

Prerequisites

OpenCost requires Prometheus for metrics storage. If you followed Lesson 2, you already have the kube-prometheus-stack installed:

bash

# Verify Prometheus is running
kubectl get pods -n monitoring | grep prometheus

Output:

Specification

prometheus-kube-prometheus-prometheus-0   2/2     Running   0   12d

Install OpenCost with Helm

bash

# Add the OpenCost Helm repository
helm repo add opencost https://opencost.github.io/opencost-helm-chart
helm repo update

# Install OpenCost
helm install opencost opencost/opencost \
  --namespace monitoring \
  --set prometheus.internal.serviceName=prometheus-kube-prometheus-prometheus \
  --set prometheus.internal.port=9090

Output:

Specification

NAME: opencost
NAMESPACE: monitoring
STATUS: deployed
REVISION: 1

Verify Installation

bash

kubectl get pods -n monitoring | grep opencost

Output:

Specification

opencost-5f7d8b6c4d-x2k9p    1/1     Running   0   45s

Access the OpenCost UI

Forward the port to access the OpenCost dashboard:

bash

kubectl port-forward -n monitoring svc/opencost 9090:9090

Open http://localhost:9090 in your browser. You will see real-time cost breakdowns by namespace, deployment, and pod.

Cost Allocation: Tracing Every Dollar to an Owner

Without labels, your cost data shows spending by namespace — but namespaces do not pay bills. Teams do. Products do. Business units do.

Cost allocation labels map Kubernetes resources to organizational structures.

Required Labels for Cost Allocation

Add these labels to every Deployment, StatefulSet, and DaemonSet:

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: task-api
  labels:
    app: task-api
    # Cost allocation labels
    cost-center: "platform"
    team: "agents"
    product: "task-manager"
    environment: "production"
    business-unit: "engineering"
spec:
  template:
    metadata:
      labels:
        app: task-api
        # Labels must also be on pods for OpenCost to aggregate
        cost-center: "platform"
        team: "agents"
        product: "task-manager"
        environment: "production"

Namespace-Level Labels

For simpler allocation, label namespaces. All pods inherit the namespace's cost center:

yaml

apiVersion: v1
kind: Namespace
metadata:
  name: agents-prod
  labels:
    cost-center: "platform"
    team: "agents"
    environment: "production"

Querying Cost by Namespace

OpenCost exposes cost data as Prometheus metrics. Query daily cost by namespace:

promql

sum(
  (
    sum(container_cpu_allocation{namespace!=""}) by (namespace)
    * on(node) group_left() node_cpu_hourly_cost
  ) +
  (
    sum(container_memory_allocation_bytes{namespace!=""}) by (namespace) / 1024 / 1024 / 1024
    * on(node) group_left() node_ram_hourly_cost
  )
) by (namespace) * 24

Output:

Specification

{namespace="agents-prod"}    8.45
{namespace="monitoring"}    12.30
{namespace="default"}        2.15
{namespace="kube-system"}    6.80

This shows daily cost in dollars for each namespace.

Querying Cost by Team

If pods have team labels, query cost by team:

promql

sum(
  container_cpu_allocation * on(node) group_left() node_cpu_hourly_cost * 24
) by (label_team)

Output:

Specification

{label_team="agents"}     15.60
{label_team="platform"}   22.40
{label_team="data"}       18.90

Now you can answer: "The agents team spent $15.60 yesterday on Kubernetes resources."

Identifying Waste: Idle and Over-Provisioned Resources

Cost visibility reveals where money goes. Waste identification reveals where it should not go.

Common Sources of Kubernetes Waste

Waste Type	Description	Impact
Over-provisioned requests	Pods request more CPU/memory than they use	Blocks capacity for other workloads
Idle namespaces	Dev environments running 24/7 but used 40 hours/week	75% wasted spend
Zombie deployments	Forgotten test deployments never cleaned up	Continuous drain
Oversized node pools	Cluster capacity far exceeds actual demand	Pay for unused nodes

CPU Waste Query

Find CPU requested but not used:

promql

# CPU waste (requested minus used)
sum(
  container_cpu_allocation - rate(container_cpu_usage_seconds_total[5m])
) by (namespace)

Output:

Specification

{namespace="dev"}           2.8
{namespace="staging"}       1.2
{namespace="production"}    0.3

The dev namespace requests 2.8 cores more than it uses. That is wasted capacity.

Waste as a Percentage

Calculate efficiency to identify the worst offenders:

promql

(1 - (
  sum(rate(container_cpu_usage_seconds_total[1h])) by (namespace) /
  sum(container_cpu_allocation) by (namespace)
)) * 100

Output:

Specification

{namespace="dev"}           85.2
{namespace="staging"}       62.4
{namespace="production"}    15.8

The dev namespace has 85% waste — only 15% of requested CPU is used.

Pods Ready for Right-Sizing

Find pods where actual usage is less than 50% of requests:

promql

(
  sum(rate(container_cpu_usage_seconds_total[7d])) by (namespace, pod) /
  sum(container_cpu_allocation) by (namespace, pod)
) < 0.5

This returns pods that are good candidates for right-sizing.

Right-Sizing with Vertical Pod Autoscaler (VPA)

Guessing resource requests leads to either waste (over-provisioning) or instability (under-provisioning). The Vertical Pod Autoscaler removes guesswork by analyzing actual usage and recommending optimal values.

VPA Components

VPA consists of three parts:

Recommender — Watches usage metrics and computes recommended resources
Updater — Evicts pods needing resource updates (when enabled)
Admission Controller — Sets resources on new pods based on recommendations

Install VPA

bash

# Clone the autoscaler repository
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler

# Deploy VPA components
./hack/vpa-up.sh

Output:

Specification

customresourcedefinition.apiextensions.k8s.io/verticalpodautoscalers.autoscaling.k8s.io created
deployment.apps/vpa-admission-controller created
deployment.apps/vpa-recommender created
deployment.apps/vpa-updater created

Create VPA in Recommendation Mode

Start with recommendation mode to see suggestions without automatic changes:

yaml

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: task-api-vpa
  namespace: default
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: task-api
  updatePolicy:
    updateMode: "Off"  # Only recommend, don't auto-update
  resourcePolicy:
    containerPolicies:
    - containerName: "*"
      minAllowed:
        cpu: 50m
        memory: 64Mi
      maxAllowed:
        cpu: 2
        memory: 4Gi

Apply the VPA:

bash

kubectl apply -f task-api-vpa.yaml

Output:

Specification

verticalpodautoscaler.autoscaling.k8s.io/task-api-vpa created

Read VPA Recommendations

After VPA collects usage data (typically 24-48 hours), check recommendations:

bash

kubectl describe vpa task-api-vpa

Output:

Specification

Recommendation:
  Container Recommendations:
    Container Name: task-api
    Lower Bound:
      Cpu:     50m
      Memory:  128Mi
    Target:
      Cpu:     125m
      Memory:  256Mi
    Uncapped Target:
      Cpu:     125m
      Memory:  256Mi
    Upper Bound:
      Cpu:     500m
      Memory:  512Mi

VPA recommends:

Target: 125m CPU, 256Mi memory (optimal based on usage)
Lower Bound: Minimum safe values
Upper Bound: Maximum the workload has needed

If your current requests are 500m CPU and 512Mi memory, VPA just identified 75% CPU over-provisioning and 50% memory over-provisioning.

Apply Right-Sizing Changes

Update your Deployment to match VPA recommendations:

yaml

resources:
  requests:
    cpu: "125m"
    memory: "256Mi"
  limits:
    cpu: "500m"      # Keep limit higher for bursts
    memory: "512Mi"

Cost Dashboards and Budget Alerts

Visibility is not useful if nobody looks at it. Dashboards and alerts make cost data actionable.

Grafana Cost Dashboard

Create a dashboard showing total cost, cost by namespace, and efficiency score:

json

{
  "title": "Kubernetes Cost Dashboard",
  "panels": [
    {
      "title": "Total Monthly Cost Estimate",
      "type": "stat",
      "targets": [{
        "expr": "sum(container_cpu_allocation * on(node) group_left() node_cpu_hourly_cost + container_memory_allocation_bytes / 1024 / 1024 / 1024 * on(node) group_left() node_ram_hourly_cost) * 24 * 30"
      }],
      "fieldConfig": {
        "defaults": { "unit": "currencyUSD" }
      }
    },
    {
      "title": "Cost by Namespace",
      "type": "piechart",
      "targets": [{
        "expr": "sum((container_cpu_allocation * on(node) group_left() node_cpu_hourly_cost + container_memory_allocation_bytes / 1024 / 1024 / 1024 * on(node) group_left() node_ram_hourly_cost) * 24) by (namespace)"
      }]
    },
    {
      "title": "Efficiency Score",
      "type": "gauge",
      "description": "CPU utilization vs allocation",
      "targets": [{
        "expr": "sum(rate(container_cpu_usage_seconds_total[1h])) / sum(container_cpu_allocation) * 100"
      }],
      "fieldConfig": {
        "defaults": {
          "unit": "percent",
          "thresholds": {
            "steps": [
              {"color": "red",    "value": 0},
              {"color": "yellow", "value": 40},
              {"color": "green",  "value": 70}
            ]
          }
        }
      }
    }
  ]
}

The efficiency gauge shows red below 40%, yellow 40–70%, and green above 70%.

Budget Alerts with Prometheus

Create alerts when spending exceeds thresholds:

yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: cost-alerts
  namespace: monitoring
spec:
  groups:
  - name: cost.alerts
    rules:
    # Daily cost threshold
    - alert: HighDailyCost
      expr: |
        sum(
          container_cpu_allocation * on(node) group_left() node_cpu_hourly_cost +
          container_memory_allocation_bytes / 1024 / 1024 / 1024 * on(node) group_left() node_ram_hourly_cost
        ) * 24 > 100
      for: 1h
      labels:
        severity: warning
      annotations:
        summary: "Daily cost exceeds $100"
        description: "Current daily cost estimate: ${{ $value | printf \"%.2f\" }}"

    # Cost spike detection
    - alert: CostSpike
      expr: |
        (
          sum(container_cpu_allocation * on(node) group_left() node_cpu_hourly_cost)
          /
          sum(container_cpu_allocation * on(node) group_left() node_cpu_hourly_cost offset 1d)
        ) > 1.5
      for: 30m
      labels:
        severity: warning
      annotations:
        summary: "Cost increased 50% vs yesterday"

    # Low efficiency alert
    - alert: LowResourceEfficiency
      expr: |
        sum(rate(container_cpu_usage_seconds_total[1h])) / sum(container_cpu_allocation) < 0.3
      for: 6h
      labels:
        severity: info
      annotations:
        summary: "Cluster CPU efficiency below 30%"
        description: "Consider right-sizing workloads"

Apply the rules:

bash

kubectl apply -f cost-alerts.yaml

Output:

Specification

prometheusrule.monitoring.coreos.com/cost-alerts created

Scheduling Non-Production for Significant Savings

Development and staging environments often run 24/7 but are used only during business hours. Running workloads for 40 hours instead of 168 hours per week reduces their cost by approximately 75%.

Scale Down Dev at Night

Create CronJobs to scale down dev namespaces outside business hours:

yaml

# Scale down at 7 PM on weekdays
apiVersion: batch/v1
kind: CronJob
metadata:
  name: scale-down-dev
  namespace: dev
spec:
  schedule: "0 19 * * 1-5"  # 7 PM Monday-Friday
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: scaler
          containers:
          - name: kubectl
            image: bitnami/kubectl:latest
            command:
            - kubectl
            - scale
            - deployment
            - --all
            - --replicas=0
            - -n
            - dev
          restartPolicy: OnFailure
---
# Scale up at 7 AM on weekdays
apiVersion: batch/v1
kind: CronJob
metadata:
  name: scale-up-dev
  namespace: dev
spec:
  schedule: "0 7 * * 1-5"  # 7 AM Monday-Friday
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: scaler
          containers:
          - name: kubectl
            image: bitnami/kubectl:latest
            command:
            - kubectl
            - scale
            - deployment
            - --all
            - --replicas=1
            - -n
            - dev
          restartPolicy: OnFailure

Create RBAC for the Scaler

The CronJob needs permission to scale deployments:

yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  name: scaler
  namespace: dev
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: deployment-scaler
  namespace: dev
rules:
- apiGroups: ["apps"]
  resources: ["deployments", "deployments/scale"]
  verbs: ["get", "list", "patch", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: scaler-binding
  namespace: dev
subjects:
- kind: ServiceAccount
  name: scaler
  namespace: dev
roleRef:
  kind: Role
  name: deployment-scaler
  apiGroup: rbac.authorization.k8s.io

Apply both files:

bash

kubectl apply -f scaler-rbac.yaml
kubectl apply -f scale-cronjobs.yaml

Output:

Specification

serviceaccount/scaler created
role.rbac.authorization.k8s.io/deployment-scaler created
rolebinding.rbac.authorization.k8s.io/scaler-binding created
cronjob.batch/scale-down-dev created
cronjob.batch/scale-up-dev created

Calculate Savings

If dev namespace costs $30/day running 24/7:

Current: $30/day × 30 days = $900/month
With scheduling: $30/day × (40h / 168h) = $7.14/day × 30 = ~$215/month
Savings: ~$685/month (76%)

Reflect on Your Skill

Your observability skill now needs cost engineering patterns. Review what you have learned:

Test Your Skill

Ask your observability skill:

Specification

Write a PromQL query that shows the top 5 namespaces by daily cost,
including both CPU and memory costs.

If the skill returns a correct aggregation query with topk(5, ...), it handles cost queries. If not, it needs the OpenCost patterns.

Identify Gaps

Review your skill for these capabilities:

OpenCost installation and configuration
Cost allocation label conventions
PromQL queries for cost by namespace/team/product
Waste detection queries (CPU/memory utilization vs allocation)
VPA configuration for right-sizing recommendations
CronJob patterns for scheduled scaling
Budget alert PrometheusRule templates

Improve Your Skill

Add missing patterns. For example, if VPA patterns are missing:

Specification

Add VPA right-sizing patterns to my observability skill. Include:
VPA resource definition with recommendation mode

How to read VPA recommendations from kubectl

Resource policy configuration with min/max bounds

Best practice: start with updateMode "Off" in production

Your skill should now cover the full FinOps lifecycle: Inform (cost visibility), Optimize (right-sizing), and Operate (budget alerts and scheduling).

Try With AI

Prompt 1: Analyze Your Cost Structure

Specification

I have installed OpenCost on my Kubernetes cluster. Help me understand my
cost structure. Ask me about my namespace organization, team structure, and
current labeling strategy. Then suggest specific queries I should run to
understand where my money is going.

What you're learning: Cost analysis starts with understanding organizational structure. The queries that matter depend on how your organization allocates responsibility. This dialogue helps you map your organization to cost queries.

Prompt 2: Design a Right-Sizing Strategy

Specification

My cluster shows 65% CPU waste according to OpenCost. Help me design a
right-sizing strategy. Ask about my workload types (stateless web apps,
stateful databases, batch jobs) and risk tolerance. Then recommend which
workloads to right-size first and which to leave alone.

What you're learning: Not all waste should be eliminated. Production databases might intentionally over-provision for safety margins. Stateless web services are safer to right-size. This conversation teaches you to prioritize based on risk.

Prompt 3: Build Your Cost Governance Model

Specification

My finance team wants monthly cost reports broken down by team and product.
Help me design a cost governance model. I'll describe our team structure and
you suggest a labeling standard, required dashboards, and alert thresholds.
Challenge me on assumptions I'm making about cost ownership.

What you're learning: Cost governance is organizational, not just technical. Labels must reflect actual accountability structures. This exercise forces you to think about who pays for shared infrastructure, how to handle platform costs, and what thresholds trigger action.

Safety note: Cost data reveals business information — which products receive investment, which teams are growing, and organizational priorities. Treat OpenCost dashboards with the same access controls as financial reports.