All posts
AI Strategy

From Jupyter to Kubernetes: The Infrastructure Leap AI Teams Always Underestimate

Why AI teams fail moving from Jupyter notebooks to production. Learn the infrastructure patterns, cost models, and governance strategies that actually work.

By Brightlume Team

The Gap Nobody Talks About

You've got a working model in Jupyter. It predicts customer churn with 87% accuracy. Your data scientist ran it on their laptop last Tuesday. Everyone's excited. Then someone asks: "How do we run this at scale?"

That question kills more AI projects than bad models do.

The leap from Jupyter to Kubernetes isn't a technical problem—it's an infrastructure problem masquerading as one. Your model works. Your data pipeline works. What breaks is the assumption that you can just "containerise it" and ship it. In reality, you're moving from an environment where a single person controls everything to one where latency, cost, governance, and resource contention all become hard constraints.

At Brightlume, we ship production-ready AI solutions in 90 days. That means we move models from Jupyter to Kubernetes constantly. And we've learned that teams consistently underestimate three things: the infrastructure complexity, the cost of not planning for scale, and the governance overhead that production demands.

This guide walks you through the infrastructure leap. We'll cover what breaks, why it breaks, and the concrete patterns that actually work.

Why Jupyter Feels Easy (And Kubernetes Feels Hard)

Jupyter notebooks are brilliant for exploration. You write code, you see results immediately, you iterate. The environment is implicit—your laptop has 16GB RAM, a GPU if you're lucky, and one person (you) controlling the entire execution context. State lives in memory. Secrets live in environment variables. Dependencies are whatever you pip installed.

This works until it doesn't.

Kubernetes is the opposite. It's explicit about everything. You define resource requests (CPU, memory, GPU). You declare dependencies in container images. You specify retry logic, health checks, and graceful shutdown behaviour. You think about network latency, persistent storage, and cost per inference.

The cognitive shift is brutal. In Jupyter, you optimise for iteration speed. In Kubernetes, you optimise for reliability, cost, and observability. These are sometimes at odds.

Let's be concrete. A typical Jupyter workflow:

  • Data scientist loads 500MB dataset into memory
  • Trains a model (30 minutes, GPU)
  • Runs inference on test set (2 seconds)
  • Iterates on feature engineering
  • Ships the .pkl file to engineering

A production Kubernetes workflow:

  • Model runs in a containerised service with explicit resource limits
  • Inference requests come from multiple clients simultaneously
  • Each request must complete within a latency SLA (say, 100ms)
  • The service must handle 1000 requests/second without OOMing
  • Failed requests must retry with exponential backoff
  • Costs must be tracked per inference and optimised
  • Model versions must be tracked, audited, and rolled back if accuracy degrades

The gap between these two worlds is where most AI projects break. Kubernetes for AI workloads has evolved specifically to address this, but teams still underestimate the transition cost.

The Infrastructure Stack You Actually Need

Let's define what we're building. A production AI inference system has these layers:

The Model Layer

Your Jupyter model is stateless code. In production, it becomes a containerised service. That container must:

  • Include the model weights (or a mechanism to download them)
  • Include all dependencies (PyTorch, transformers, etc.)
  • Expose an API endpoint (typically HTTP/gRPC)
  • Handle concurrent requests
  • Log inference requests for auditing

You're no longer thinking about "running a model." You're thinking about "running a service that serves model predictions."

The Orchestration Layer

Kubernetes manages the service. It handles:

  • Pod scheduling across nodes
  • Resource allocation (CPU, memory, GPU)
  • Health checks and automatic restarts
  • Load balancing across replicas
  • Rolling updates (deploying new model versions without downtime)

This is where cost and latency live. If you request 8GB RAM per pod and run 10 replicas, you're paying for 80GB of RAM whether you use it or not.

The Observability Layer

In Jupyter, you see errors immediately. In Kubernetes, errors happen in pods you don't directly control. You need:

  • Structured logging (which pod served which request?)
  • Metrics (latency, throughput, error rate per model version)
  • Tracing (which service called which, and how long did it take?)
  • Alerting (when accuracy degrades or latency spikes)

The Data Layer

Your Jupyter notebook loaded data from a CSV or database. In production:

  • Data arrives continuously (streaming or batch)
  • You need versioning (which training data produced which model?)
  • You need caching (avoid re-downloading the same features)
  • You need governance (audit which data was used for which prediction)

This is where AI model governance becomes non-negotiable. You can't just "run inference." You must know exactly which data, which model version, and which code produced each prediction.

The Cost Trap

Here's where teams get blindsided. A Jupyter notebook running inference on your laptop costs nothing (you've already paid for the laptop). A Kubernetes cluster running inference costs money every second it's alive.

Let's do the math. You want to serve 100 inference requests per day. Your model takes 2 seconds per inference on a GPU. A naive approach:

  • Run a Kubernetes pod with an A100 GPU (40GB memory)
  • Keep it alive 24/7
  • Cost: ~$3 per hour = ~$2,200 per month
  • Actual usage: 100 requests × 2 seconds = 200 seconds per day = 0.23% utilisation

You're paying for 99.77% idle capacity.

The production approach:

  • Run inference on a CPU (slower, but cheaper)
  • Or use a serverless GPU service (pay per inference)
  • Or batch requests (run inference every 10 minutes on accumulated requests)
  • Or cache aggressively (if the same input appears twice, don't re-run inference)

Each of these trades off latency, cost, or complexity. The key is measuring the tradeoff. You need:

  • Cost per inference (dollars)
  • Latency per inference (milliseconds)
  • Throughput (requests per second)
  • Accuracy (does the model still work?)

Teams that skip this step end up with a Kubernetes cluster that's either too expensive or too slow. And because they didn't measure it in Jupyter, they don't know which one.

The Latency Problem

In Jupyter, latency is invisible. Your model runs on your machine. You see the result in 2 seconds. Done.

In Kubernetes, latency has multiple sources:

  • Network latency: Request travels from client to Kubernetes cluster (10-100ms)
  • Queue latency: Request waits for a free pod to handle it (0-1000ms, depending on load)
  • Model latency: Model actually runs (2000ms for your 2-second inference)
  • Serialisation latency: Request and response are serialised/deserialised (10-100ms)

Total: 2010-2200ms, not 2000ms. But if you have 1000 concurrent requests and only 10 pods, queue latency balloons to 10 seconds. Customers notice.

The production approach requires:

  • Profiling your model in Kubernetes (not on your laptop)
  • Setting latency SLAs (e.g., "95th percentile latency < 100ms")
  • Capacity planning based on peak load, not average load
  • Autoscaling rules (add pods when latency exceeds threshold)

Running agents on Kubernetes introduces additional complexity because agents are stateful. They remember conversation history, maintain tool state, and take longer to execute. The infrastructure must account for this.

Moving from Notebooks to Containers

Containerisation is the first step, and it's where most teams stumble. A container is a snapshot of your environment: code, dependencies, model weights, everything.

The Jupyter-to-container journey:

Step 1: Extract Your Model from Jupyter

Your Jupyter notebook is a linear script. Production code is modular. You need:

model/
  __init__.py
  inference.py      # Load model, run prediction
  preprocessing.py  # Feature engineering
  postprocessing.py # Format output
requirements.txt    # Dependencies
Dockerfile          # Container definition

This is more work than it sounds. Jupyter notebooks mix data exploration, feature engineering, and model training. Production code separates these concerns. Your data scientist might have 200 lines of exploration code that produces 3 lines of actual feature engineering. You need to extract those 3 lines.

Step 2: Version Your Dependencies

In Jupyter, you run pip install torch. It installs whatever the latest version is. In production, you need exact versions:

torch==2.1.0
transformers==4.36.0
numpy==1.24.3

Why? Because a minor version bump can change inference latency or accuracy. You need reproducibility. GitHub's kubetorch is an open-source tool that helps bridge this gap—it lets you iterate in Jupyter-like environments while building toward Kubernetes deployment.

Step 3: Handle Model Weights

Your Jupyter notebook loads weights from a local file. In production:

  • Weights live in object storage (S3, GCS)
  • The container downloads them at startup
  • Downloading a 5GB model takes 30 seconds
  • If the download fails, the pod fails to start
  • Kubernetes restarts the pod
  • The pod tries to download again
  • This loops until the download succeeds

You need a startup script that:

  • Downloads weights with retries
  • Validates checksums (did the download corrupt the file?)
  • Logs progress (so you can debug if it fails)
  • Fails fast if something's wrong

Step 4: Expose an API

In Jupyter, you call model.predict(input) directly. In production, requests come over HTTP. You need a web server:

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()
model = load_model()

class PredictionRequest(BaseModel):
    features: list[float]

@app.post("/predict")
def predict(request: PredictionRequest):
    output = model.predict(request.features)
    return {"prediction": output}

This is straightforward, but it introduces new failure modes:

  • What if the request is malformed?
  • What if the model crashes during inference?
  • What if inference takes longer than the request timeout?

You need error handling, logging, and timeouts. AI agents that write and execute code add another layer—they need to handle tool execution failures, timeout retries, and state management across API calls.

Kubernetes Deployment Patterns

Once your model is containerised, you deploy it to Kubernetes. The simplest pattern:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-inference
spec:
  replicas: 3
  selector:
    matchLabels:
      app: model-inference
  template:
    metadata:
      labels:
        app: model-inference
    spec:
      containers:
      - name: inference
        image: myregistry/model:v1.2.3
        ports:
        - containerPort: 8000
        resources:
          requests:
            memory: "4Gi"
            cpu: "2"
            nvidia.com/gpu: "1"
          limits:
            memory: "8Gi"
            cpu: "4"
            nvidia.com/gpu: "1"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10

This says: "Run 3 copies of my model, each with 2 CPU cores and 1 GPU, with 4GB RAM requested and 8GB maximum."

But this is where the real decisions happen:

How many replicas? If each replica handles 100 requests/second and you expect 250 requests/second peak, you need at least 3 replicas. But you also need headroom for failures and rolling updates. So maybe 5.

What resource requests? Too high, and you waste money. Too low, and pods get evicted when they exceed limits. You need to profile your model under realistic load.

What about GPU sharing? Can multiple inference requests share one GPU? Yes, but they'll contend for memory and compute. You need benchmarks.

This is where research on multi-tenant Kubernetes clusters becomes relevant. Running multiple AI workloads on shared infrastructure requires careful resource isolation and monitoring.

The Governance and Auditing Layer

Production AI requires governance. You need to know:

  • Which model version served which prediction?
  • Which data was used?
  • Did the model's accuracy degrade over time?
  • Can you roll back to a previous version if something breaks?

This is AI model governance in practice. In Jupyter, you might have one model file. In production, you have:

  • Model registry (which versions exist?)
  • Model metadata (accuracy, latency, training data)
  • Deployment tracking (which version is running where?)
  • Inference logging (which version made which prediction?)
  • Rollback procedures (how do we revert to the previous version?)

Tools like MLflow or Weights & Biases help, but they add operational overhead. Someone needs to manage the model registry. Someone needs to monitor accuracy. Someone needs to own rollback procedures.

At Brightlume, this is built into our 90-day delivery process. We don't just deploy a model—we deploy a versioned, auditable, rollback-capable system. That's what separates "deployed" from "production-ready."

Security and Data Protection

Jupyter notebooks often contain secrets (database passwords, API keys) in plain text. Kubernetes can't work that way.

Production requirements:

  • Secrets (passwords, API keys) stored in a secrets manager, not in code
  • Encryption in transit (HTTPS for all API calls)
  • Encryption at rest (model weights encrypted in storage)
  • Access control (who can deploy new models? who can see inference logs?)
  • Audit logging (every deployment, every model change, every inference)

AI agent security becomes critical when agents have access to tools, databases, or external APIs. A compromised agent can leak data or execute unintended actions. You need:

  • Input validation (does the request look legitimate?)
  • Rate limiting (prevent brute-force attacks)
  • Monitoring (detect unusual access patterns)
  • Isolation (agents can only access resources they're authorised for)

Cost Optimisation in Kubernetes

Teams often deploy to Kubernetes without understanding the cost implications. Research shows half of Kubernetes teams manage costs manually, despite executive pressure to control expenses.

Cost optimisation requires:

Right-sizing: Profile your workload and request exactly the resources you need, not more.

Autoscaling: Add pods when load increases, remove them when load decreases. A model that handles 100 requests/second at 2 seconds per request needs 200 pods if requests are sequential, but maybe 5 pods if requests are batched.

Spot instances: Use cheaper, preemptible instances for non-critical workloads. But be prepared for sudden termination.

Model optimisation: A smaller, faster model might cost 10x less to run. Quantisation (reducing precision from float32 to int8) can reduce latency by 4x with minimal accuracy loss.

Caching: If the same input appears twice, cache the result. Avoid redundant inference.

At Brightlume, cost optimisation is part of the 90-day process. We don't just deploy—we optimise for the specific constraints of your business (latency, cost, accuracy).

Observability: Seeing What's Actually Happening

In Jupyter, you see errors immediately. In Kubernetes, errors happen in pods you don't control. You need visibility.

The observability stack:

Logging: Every inference request should be logged. What input did it receive? What output did it produce? How long did it take? If something goes wrong, you need the logs to debug it.

Metrics: Latency, throughput, error rate, GPU utilisation. You need graphs showing these over time. When latency spiked yesterday, what else happened?

Tracing: If your system has multiple services (data preprocessing, model inference, post-processing), you need to see how a request flows through them. Where is the bottleneck?

Alerting: When metrics exceed thresholds (latency > 100ms, error rate > 1%), someone needs to know immediately.

Tools like Prometheus (metrics), Grafana (dashboards), and Jaeger (tracing) are industry standard. But they add operational complexity. Someone needs to maintain them, tune alert thresholds, and respond to alerts.

The Transition Process

Moving from Jupyter to Kubernetes isn't a big bang. It's a sequence of steps:

Phase 1: Extract and containerise (2-3 weeks)

  • Extract model code from notebook
  • Write requirements.txt
  • Build Docker image
  • Test locally

Phase 2: Single-pod deployment (1-2 weeks)

  • Deploy to Kubernetes (1 replica)
  • Set up basic monitoring
  • Load test to find bottlenecks
  • Profile latency and resource usage

Phase 3: Scale and optimise (2-3 weeks)

  • Add replicas based on load
  • Implement autoscaling
  • Optimise resource requests
  • Reduce costs

Phase 4: Add governance and observability (2-3 weeks)

  • Set up model registry
  • Implement inference logging
  • Add alerting
  • Document runbooks for common failures

This is roughly 90 days. Which is why at Brightlume, we use this timeline as our standard delivery window. We've done it enough times to know what works.

Common Pitfalls

Teams consistently make the same mistakes:

Pitfall 1: Assuming Kubernetes solves the problem

Kubernetes is an orchestration platform. It doesn't make your model faster, cheaper, or more accurate. It makes it easier to run at scale. But you still need to optimise the model itself.

Pitfall 2: Not measuring in production

You measured latency in Jupyter (2 seconds). In Kubernetes, it's 2.2 seconds (with network overhead). But you didn't measure it, so you assumed it was still 2 seconds. Now you're surprised that you need 10% more pods than expected.

Pitfall 3: Forgetting about data

Your Jupyter notebook had clean training data. Production data is messy. Missing values, outliers, distribution shift. Your model was trained on 2023 data. It's now 2025. Customer behaviour has changed. The model's accuracy degrades. You didn't notice because you weren't monitoring it.

Pitfall 4: Skipping security

Your Jupyter notebook had database credentials in a comment. In production, that's a security breach. You need secrets management, encryption, and access control.

Pitfall 5: Underestimating operational overhead

You deployed a model. Now you need to monitor it, update it, debug it, roll it back, and explain it to auditors. This is 50% of the work. Teams often underestimate this and end up with unmaintainable systems.

Building for Production from the Start

The best time to think about Kubernetes is before you write your Jupyter notebook. But that's not realistic. So the next best time is as soon as you have a working model.

Think about:

  • Reproducibility: Can someone else run your code and get the same result?
  • Modularity: Is your code split into reusable components?
  • Testability: Can you test each component independently?
  • Observability: Can you measure performance?
  • Versioning: Can you track which version of code produced which result?

These are software engineering practices that happen to make the Jupyter-to-Kubernetes transition much smoother.

At Brightlume, we help teams think about this from day one. Our AI-native engineering approach means we're not just shipping models—we're shipping systems that are built for production from the start.

The Role of AI Agents in Modern Infrastructure

AI agents add a new dimension to this infrastructure story. AI agents as digital coworkers are stateful—they remember context, maintain tool state, and take longer to execute than simple inference.

Deploying agents to Kubernetes requires:

  • State management: Where does conversation history live? How do you handle pod restarts?
  • Tool execution: Agents call external tools (databases, APIs). These calls must be logged, monitored, and retried on failure.
  • Timeout handling: Agents might take 30 seconds to complete a task. Your API timeout needs to accommodate this.
  • Cost tracking: Agent execution is expensive (multiple LLM calls per request). You need to track and optimise cost per agent execution.

For operations teams, AI agents for IT operations can handle ticket triage, incident response, and monitoring—but only if the infrastructure supports them.

Measuring Success

How do you know if your Kubernetes deployment is working? You need metrics:

  • Latency: 95th percentile latency (not average—peak matters)
  • Throughput: Requests per second
  • Availability: Percentage of requests that succeed (target: 99.9%)
  • Cost: Dollars per 1000 inferences
  • Accuracy: Does the model still work? (measure on production data)

You should measure these in Jupyter too, so you have a baseline. Then measure them in Kubernetes and compare. If Kubernetes is slower or more expensive, you need to optimise.

At Brightlume, we use these metrics to validate that a deployment is truly production-ready. It's not enough to ship code—it has to meet the SLAs that matter to your business.

Next Steps: From Kubernetes to Agentic Workflows

Once you've mastered Kubernetes for simple inference, the next frontier is orchestrating multiple agents. AI agent orchestration involves coordinating multiple AI agents, each with different capabilities, running concurrently.

This requires:

  • Service mesh: How do agents communicate with each other?
  • State coordination: How do agents share context?
  • Error handling: What happens if one agent fails?
  • Cost control: How do you prevent agents from calling each other in loops?

This is where infrastructure becomes strategy. The teams that master this will have a significant competitive advantage.

Conclusion: The Infrastructure Leap Is Non-Negotiable

Moving from Jupyter to Kubernetes isn't optional if you want to run AI at scale. It's the infrastructure leap that separates pilots from production systems.

The key insight: this leap is about much more than containers and orchestration. It's about building systems that are reliable, observable, auditable, and cost-effective. It's about thinking in terms of SLAs, not just accuracy. It's about governance, security, and operational readiness.

Teams that underestimate this leap end up with expensive, unmaintainable systems. Teams that plan for it from the start end up with systems that scale, that are easy to operate, and that deliver real business value.

At Brightlume, we've built this process into our 90-day delivery model because we know from experience that this is where most AI projects fail. We deliver production-ready AI solutions that are built for Kubernetes from the start—not retrofitted afterward.

If you're planning an AI deployment, start thinking about infrastructure now. Profile your model. Estimate costs. Plan for scale. The teams that do this end up with systems that work. The teams that skip this step end up with expensive lessons.

Learn more about how Brightlume delivers AI solutions in 90 days, and explore our practical guides on shipping production AI to understand the full journey from pilot to production.