Deployments keep your AI agent running forever. But what about tasks that should run once and stop? Or tasks that run on a schedule?
These are batch workloads—finite tasks that complete and exit. Kubernetes provides two primitives for this: Jobs (run once) and CronJobs (run on a schedule).
You've learned that Deployments manage Pods that should run continuously. But not all workloads are long-running:
Key insight: Deployments use restartPolicy: Always—Pods restart on completion. Jobs use restartPolicy: Never or OnFailure—Pods don't restart after successful completion.
Create a Job that simulates an AI agent maintenance task—processing data and exiting:
Output: (This is the manifest structure; we'll apply it next)
apiVersion: batch/v1 Jobs use the batch API group, not apps like Deployments.
kind: Job Tells Kubernetes this is a finite workload.
spec.template The Pod template—identical to what you'd put in a Deployment's template. The Job creates one or more Pods using this template.
restartPolicy: Never Critical difference from Deployments. When the container exits with code 0 (success), the Pod stays Completed and doesn't restart.
backoffLimit: 4 If the container fails (non-zero exit code), Kubernetes retries up to 4 times before marking the Job as failed.
Save the manifest as embedding-refresh-job.yaml and apply it:
Output:
Watch the Job progress:
Output:
Check the Pod status:
Output:
Notice STATUS: Completed—the Pod finished successfully and stopped. Unlike a Deployment Pod (which would show Running), this Pod is done.
View the logs to see what happened:
Output:
The Job ran, completed its task, and stopped. The Pod remains in Completed state for inspection (logs, debugging) until you delete it.
Unlike Deployments (which use ReplicaSets as intermediaries), Jobs directly manage their Pods. The naming follows the pattern: {job-name}-{random-suffix}.
Delete the Job (this also deletes its Pods):
Output:
What if you need to process 10,000 documents for embedding refresh? Running sequentially takes too long. Jobs support parallelism:
Key parameters:
Apply and watch:
Output:
Kubernetes maintains 2 Pods running at any time until 5 completions are achieved.
Check Job status:
Output:
For AI workloads, parallel with fixed count is most common—split a large dataset into chunks and process in parallel.
CronJobs create Jobs on a schedule. Every execution creates a new Job, which creates new Pod(s).
Common patterns:
Create a CronJob that cleans up old agent logs every minute (for demonstration—in production, use a longer schedule):
New fields:
schedule Cron expression defining when to create Jobs.
jobTemplate The Job template—notice it's the same structure as a Job spec, wrapped in jobTemplate.spec.
successfulJobsHistoryLimit: 3 Keep the last 3 successful Jobs (and their Pods) for inspection. Older ones are auto-deleted.
failedJobsHistoryLimit: 1 Keep only the last failed Job for debugging.
Apply and watch:
Output:
Wait a minute and check again:
Output:
List Jobs created by the CronJob:
Output:
Each Job name includes a timestamp-based suffix (28504821, 28504822).
What if a Job is still running when the next schedule triggers? Configure with concurrencyPolicy:
For AI workloads (like embedding refresh), use Forbid to prevent overlapping:
Your RAG agent needs fresh embeddings from updated knowledge base:
When upgrading your agent's model, run a migration Job:
ttlSecondsAfterFinished: Automatically delete the Job and its Pods after the specified seconds. Useful for one-time migrations you don't need to keep.
Process 1000 documents for a new knowledge base:
JOB_COMPLETION_INDEX: Kubernetes injects a unique index (0-99) into each Pod. Your code uses this to determine which batch of documents to process.
Job: Kubernetes primitive for running a task to completion. Creates Pods that stop after successful execution.
CronJob: Creates Jobs on a schedule using cron expressions. Manages Job history automatically.
completions: Number of successful Pod completions required for the Job to finish.
parallelism: Maximum number of Pods that can run simultaneously.
restartPolicy: Must be Never or OnFailure for Jobs (not Always).
backoffLimit: Number of retries before marking a Job as failed.
concurrencyPolicy: How CronJobs handle overlapping executions (Allow, Forbid, Replace).
ttlSecondsAfterFinished: Auto-cleanup of completed Jobs after a time period.
Open a terminal and work through these scenarios:
Your task: Create a Job that backs up your agent's conversation history to an S3 bucket.
Ask AI: "Create a Kubernetes Job manifest that runs an S3 backup using the AWS CLI. It should copy files from /data/conversations to s3://my-bucket/backups/."
Review AI's response:
Tell AI: "The Job should mount a PersistentVolumeClaim named 'agent-data' to access the conversation files."
Reflection:
Your task: Your nightly CronJob hasn't run successfully in 3 days. Diagnose the issue.
Ask AI: "My CronJob named 'nightly-sync' shows LAST SCHEDULE was 3 days ago but ACTIVE is 0. What commands should I run to diagnose this?"
AI should suggest:
Ask: "The Job Pod shows ImagePullBackOff. What does this mean and how do I fix it?"
Reflection:
Your task: You have a Job processing 1000 items with completions: 1000 and parallelism: 50. It's consuming too many cluster resources.
Ask AI: "How can I run a Kubernetes Job that processes 1000 items but limits resource consumption? Currently using parallelism: 50 but it's overwhelming the cluster."
AI might suggest:
Ask: "Show me how to use a work queue pattern instead of one Pod per item."
Reflection:
You built a kubernetes-deployment skill in Lesson 0. Test and improve it based on what you learned.
Ask yourself:
If you found gaps: