Running Machine Learning Pods On GPU Nodes

One of our clients uses EKS and they had a problem: running machine learning pods on a special Kubernetes node group that has GPU nodes.

We solved this problem using Kubernetes taints, tolerations, labels, and node selectors.

This is how to configure it:

Choose an Amazon EKS-optimized Arm Amazon Linux AMI and install the NVIDIA device plugin for Kubernetes.

The link below describes how to configure this part properly:
https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html

Before proceeding to the next section, make sure you have:

nodes using EKS optimised Arm Amazon Linux AMI
a node group that contains these nodes; and
NVIDIA device plugin for Kubernetes installed.

Prevent pods from running on the GPU nodes

To achieve this, add a taint to the node group that has the GPU nodes.

If you are using the AWS Console, follow these instructions:

Go to EKS
Select the cluster
Click on the Compute tab
Click on the target Node Group
Click on the Kubernetes Taints tab
Add a Kubernetes taint like this:
- Key: GPU
- Value: true
- Effect: NoSchedule

When this taint is added, basically it means that no pod can be scheduled to execute on any of the nodes of this node group unless the pod tolerates this taint.

Allow the target pods to run on GPU nodes

Add the tolerations to the target Kubernetes pod/deployment. This is a sample of tolerations in a Kubernetes deployment yaml file:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  template:
    spec:
      tolerations:
        - key: "GPU"
          operator: "Equal"
          value: "true"
          effect: "NoSchedule"

When the tolerations tag is added, it means this pod ‘tolerates’ being scheduled to nodes of the node group above.

But how can we prevent these pods from being scheduled in other node groups?

Add the label to the node group

If you are using the AWS Console, follow these instructions:

Go to EKS
Select the cluster
Click on the Compute tab
Click on the target Node Group
Click on the Kubernetes Labels
Add a new label with the following values:
- Key = GPU
- Value = true

This step does not change anything yet, but is a prerequisite for the next step.

Ensure these pods can only be scheduled on this node group

To do this, it’s necessary to add a node selector.

This is a sample of nodeSelector and tolerations together in a Kubernetes deployment yaml file:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  template:
    spec:
      nodeSelector:
        GPU: "true"
      tolerations:
        - key: "GPU"
          operator: "Equal"
          value: "true"
          effect: "NoSchedule"

Since the node group also has the label “GPU: true”, then it guarantees that the pods created by this deployment can only be scheduled in the GPU node group.

Wrapping up

In this blog post, we discussed how to schedule pods in different node groups.

First, we learned that we can use taints and tolerations to repel other pods from running on node groups, then we looked at using node selectors and labels to ensure pods will only run in a specific node group.

The combination of these two techniques allows us to control where pods are scheduled in a very flexible way and, more importantly, solve our client’s problem.

At DNX Solutions, we work to bring a better cloud and application experience for digital-native companies in Australia.

Our current focus areas are AWS, Well-Architected Solutions, Containers, ECS, Kubernetes, Continuous Integration/Continuous Delivery and Service Mesh.

We are always hiring cloud engineers for our Sydney office, focusing on cloud-native concepts.

Check our open-source projects at https://github.com/DNXLabs and follow us on Twitter, Linkedin or Facebook.

Stay informed on the latest
insights and tech-updates

No spam - just releases, updates, and tech information.

17 Jul

Tech Insights

Reviving Stalled GenAI POCs: A Practitioner’s Guide to Mid-Flight Recovery

Posted by

Kelly Griffin

17 July 2025

The Reality of POC Stagnation Every customer has been there. The initial proof-of-concept (POC) launch was promising—stakeholders were...

Why Your AI Strategy Needs More than Just Good Intentions

10 Jul

Tech Insights

Why Your AI Strategy Needs More Than Just Good Intentions

Posted by

Kelly Griffin

10 July 2025

Let’s be honest: every executive I meet these days has AI on their mind. And for good reason. Amazon just announced it’s investing AU$2...

How GenAI Helped Us Solved a Performance Bottleneck

03 Jul

Tech Insights

How GenAI Helped Us Solve a Hidden Performance Bottleneck

Posted by

DNX

3 July 2025

After over 15 years of working alongside engineering teams to make operations more efficient, I've come to appreciate a simple truth: u...

24 Jun

Tech Insights

Beyond Migration Part 2: Practical AWS Container Platform Security Controls for Real-World Resilience

Posted by

Shannon Zorn

24 June 2025

In Part 1, we covered why an insecure AWS container platform can drag your whole business down—breaches, fines, outages, the lot. Now, ...

22 Jun

Tech Insights

Beyond Migration Part 1: Hardening Your Modern AWS Container Platform for Business Resilience

Posted by

Shannon Zorn

23 June 2025

When you modernise in AWS, you’re usually moving your apps into AWS containers with services like Amazon EKS or AWS Fargate. While it m...

16 Jun

Tech Insights

The Hidden Pillars of Cloud Modernisation: Why People and Process Matter More Than Technology

Posted by

Kelly Griffin

30 June 2025

Cloud modernisation has become the North Star for organisations seeking to remain competitive as the technological advancements it brin...

11 Jun

Tech Insights

Java Isn’t Dead – It’s Just Stuck

Posted by

DNX

24 June 2025

Java has quietly powered enterprise systems for over two decades. While the tech world rushes toward Kubernetes, serverless, and GenAI,...

10 Jun

Tech Insights

AWS Summit Sydney 2025: Where Cloud Meets Community

Posted by

Kelly Griffin

10 June 2025

What a week it's been! As the dust settles on AWS Summit Sydney 2025, one thing remains crystal clear – while the technology showcased ...

14 May

Tech Insights

IT Modernisation: It’s a Business Shift, Not Just a Tech Project

Posted by

Kelly Griffin

3 June 2025

How many of you have pitched an IT modernisation project, only to see it stall, face resistance, or get shelved? It’s a frustratingly c...

07 May

Newsroom

DNX Solutions Collaborates with Red Hat to Drive Enterprise Modernisation and Hybrid Cloud Adoption

Posted by

DNX

3 June 2025

- DNX Solutions, a cloud-native AWS Premier Partner, is excited to announce its strategic collaboration with Red Hat, the world’s leadi...

28 Apr

Newsroom

DNX Solutions Signs Two AWS Strategic Agreements to Boost AI and Modernisation Services

Posted by

DNX

3 June 2025

- DNX Solutions is proud to announce an enhanced collaboration with Amazon Web Services (AWS) through two new Strategic Collaboration A...

03 Apr

Tech Insights

Cloud Quality Without Compromise: Building a Stronger Foundation for Your Business

Posted by

DNX

3 April 2025

In the fast-moving world of cloud technology, cloud quality is often an afterthought—until something goes wrong. A system outage, a sec...

Categories

Latest Case Studies

Search More

Tech Insights

Latest Publications

WHITEPAPER

WHITEPAPER

Inside DNX

About Us

Our Social Impact

Careers

Mentorship Program

Newsroom

See All News

Using taints and tolerations in EKS to run machine learning pods on GPU nodes

This is how to configure it:

Choose an Amazon EKS-optimized Arm Amazon Linux AMI and install the NVIDIA device plugin for Kubernetes.

Prevent pods from running on the GPU nodes

Allow the target pods to run on GPU nodes

Add the label to the node group

Ensure these pods can only be scheduled on this node group

Wrapping up

Stay informed on the latest insights and tech-updates

Related Posts

Adopt

Evolve

Operate

Stay informed on the latest insights and tech-updates

Case Studies

Industries

Insights Room

About Us

Stay informed on the latest
insights and tech-updates