AI & Kubernetes

AI & Kubernetes

Just like so many in the tech industry, Artificial Intelligence (AI) has come to the forefront in my day-to-day work. I’ve been starting to learn about how “AI” fits into the world of Kubernetes- and vice versa. This post will start a series where I explore what I’m learning about AI and Kubernetes.

Types of AI Workloads on Kubernetes

To describe the AI workloads engineers are running on Kubernetes, we need some terminology. I’ve found it useful to describe 3 major types of workloads: training, inference, and serving. Each of these terms describes a different aspect of the work Platform Engineers do to bring AI workloads to life. Platform Engineers bridge the gap between Data Scientists who design models, and the end users who interact with trained implementations of the models those Data Scientists designed.

A woman (Data Scientist) sits at a drafting table working on blueprints for a robot.
A man (Platform Engineer) works on building the robot from the Data Scientist's blueprint.

Data Scientists design models while Platform Engineers have an important role to play in making them run on hardware.

There’s a lot of work that happens before we get to the stage of running an AI model in production. Data scientists choose the model type, implement the model (the structure of the “brain” of the program), choose the objectives for the model, and likely gather training data. Infrastructure engineers manage the large amounts of compute resources needed to train the model and to run it for end users. The first step between designing a model and getting it to users, is training.

Note: AI Workloads are generally a type of Stateful Workload, which you can learn more in my post about them.

Training Workloads

“Training” a model is the process for creating or improving the model for its intended use. It’s essentially the learning phase of the model’s lifecycle. During training, the model is fed massive amounts of data. Through this process, the AI “learns” patterns and relationships within the training data through algorithmic adjustment of the model’s parameters. This is the main workload folks are usually talking about when discussing the massive computational and energy requirements of AI.

An AI model in the training phase is still learning. A robot stands poised to drop a book into a fish bowl, with books scattered haphazardly on the floor and on the bookshelf behind it. A platform engineer facepalms while a data scientist looks on with concern.

During training, the AI model is fed massive amounts of data, which it “learns” from, algorithmically adjusting its own parameters.

Why Kubernetes for Training

Kubernetes makes a lot of sense as a platform for AI training workloads. As a distributed system, Kubernetes is designed to manage a huge amount of distributed infrastructure and the networking challenges that come with it. Training workloads have significant hardware requirements, which Kubernetes can support with GPUs, TPUs, and other specialized hardware. The scale of a model can vary greatly- from fairly simple, to very complex and resource-intensive. Scaling is one of Kubernetes’ core competencies, meaning it can manage the variability of training workloads’ needs as well.

Kubernetes is also very extensible, meaning it can integrate with additional useful tools, for example, for observability/monitoring massive training workloads. A whole ecosystem has emerged, full of useful tools for AI/Batch/HPC workloads on Kubernetes. Kueue is one such tool- a Kubernetes-native open source project for managing the queueing of batch workloads on Kubernetes.

Inference Workloads

You could say that training makes the AI into as much of an “expert” as it’s going to be. At this stage, it’s time for the AI to start serving user requests. Running a pre-trained model is its own type of workload. These “inference” workloads are generally much less resource-intensive than “training” workloads, but the resource needs of inference workloads can vary significantly.

A skilled robot shows off its work: a perfectly organized bookshelf. The data scientist and platform engineer who created it express their approval with warm expressions, thumbs up, and clapping.

An “Inference Workload” describes running a trained model. This model should be able to do its expected tasks relatively well. If the workload is described as “inference,” that means it may or may not serve users.

Inference workloads can range from a fairly simple, lightweight implementation – to much more complex and resource-intensive ones. The term “inference workload” can describe a standalone, actively running implementation of a pre-trained AI model. Or, it can describe an AI model that functions essentially as a backend service within a larger application, often in a microservice-style architecture.

Why Kubernetes for Inference

Inference workloads can have diverse resource needs. Some might be lightweight and run on CPUs, while others might require powerful GPUs for maximum performance. Kubernetes excels at managing heterogeneous hardware, allowing you to assign the right resources to each inference workload for optimal efficiency.

Kubernetes provides flexibility in the way an inference workload is used. Users may interact with it directly as a standalone application. Or they may go through a separate frontend as part of a microservice. Whether the workload is standalone or one part of a whole, we call the workload that runs an AI model, “inference.”

Serving Workloads

While we use “inference” to talk about the AI part of an application specifically, “serving” describes the application workload as a whole. Inference workloads can exist without serving users, but a serving workload can’t exist without an inference workload behind it. The term “serving workload” can describe everything from packaging the AI model along with necessary code and dependencies, to delivering it to end-users in whatever form it may take. If someone refers to their AI workload as a “serving workload,” you can at least be sure that it’s serving end users.

A trained libriarian bot stands welcomingly in front of an organized bookshelf. A user requests a specific book.

A “serving workload” is one that’s running a trained AI model to serve users.

It can be hard to grasp the difference between “inference” and “serving” AI workloads. Essentially, “serving” refers to the full microservice application, whereas “inference’ refers to just the AI service within that larger application. So a “serving” workload generally involves an “inference” workload too, though the opposite may not necessarily be true.

Why Kubernetes for Serving

Over the last decade, Kubernetes has developed a reputation for being a great platform to run microservice-style applications. Naturally, many of the reasons you would use Kubernetes for a microservice-style architecture application apply for serving AI workloads.

Kubernetes’ scaling, self-healing, and networking capabilities work well for workloads split into multiple services. Thanks to containers, Kubernetes is able to manage diverse workloads in terms of hardware requirements, programming languages, and more. Kubernetes provides a unifying structure for managing all the parts of the app, even if those parts function very differently,

Fine-Tuning AI: It’s All About Context

I’m enjoying learning about the ways “AI” fits into the world of “Kubernetes,” and there’s a lot more to learn! In this post, we explored AI training, inference, and serving workloads and why to run them on Kubernetes. These workload types are great for understanding what it means to run AI models on Kubernetes. But the real value of AI is in its ability to understand and convey context. To make a generic AI model useful, it needs to be made aware of the context it’s operating in, and what role it’s fulfilling. “Fine-tuning” refers to the techniques for adding context to a generic model. In a future post, I’ll dive into the Retrieval-Augmented Generation (RAG) and how it allows us to customize a pre-trained model.