Introduction
When dealing with multiple node pools, you usually want to configure node affinity so that pods stick to nodes with a specific characteristic.
The reasons for this can be manifold. For example, to take advantage of specialized hardware or resources on specific nodes, such as GPUs or high-memory nodes. It also can help increasing security by running sensitive workloads on separate nodes. Another use-case is saving money, e.g., by separating application environments (DEV, QA, UAT, PROD) onto different types of node poolds.
What ever your reasons are, this post will show you two methods to bind pods to node pools. The first is to use nodeSelector
and the second is called Node affinity, which is conceptually similar but is more expressive and allows specifying soft rules. Let's dive in!
Option 1: Using the nodeSelector
This is the simplest way to bind pods to nodes. All you need to do is to optionally add labels to your nodes and then add the nodeSelector
field to your pod specification.
💡 Depending on your use-case, you might want to consider using the auto-created labels by Azure, for example agentpool=foobarpool
!
Labeling the nodes (optionally)
You'd have to issue the Azure CLI command below to label an existing node pool.
All nodes in the pool will inherit this label. You can check the success with kubectl get nodes --show-label
or alternatively use Azure CLI
Set the nodeSelector
The nodeSelector
field belongs to the PodSpec and is the simplest recommended form of node selection constraint. It expects a map, which is a collection of key-value pairs, that you'd usually set in the PodTemplateSpec of your deployment manifest.
nodeSelector (map[string]string)
According to the syntax, both of the manifests below are valid.
However, Kubernetes only schedules pods onto nodes that have each label specified (AND condition). So in our example, the pod below won't be scheduled on the foobarpool
node pool.
If the condition defined by the nodeSelector
can not be fulfilled, that pod won't get scheduled and be stuck in Pending
state.
Verify the result
Lastly, let's verify that the pods end up on the expected nodes/node pool.
As depicted in the output, the pod runs on a node belonging to the foobarpool
node pool.
Option 2: Using affinity
As mentioned in the introduction, nodeSelector
is a simple and quick way to configure node affinity. However, a second option provides more granular control... the object's name is affinity
. Let's have a look at it.
According to the Kubernetes API, the affinity
object can take three different types of constraints, which are nodeAffinity
, podAffinity
and podAntiAffinity
.
NodeAffinity allows binding pods to nodes, whereas podAffinity and podAntiAffinity allow grouping multiple pods together on a single node respectively keeping them apart from each other (this blog post will only deal with nodeAffinitiy
).
Skimming further through the API documentation, we can see that nodeAffinity can take two types of affinity scheduling rules, which are:
preferredDuringSchedulingIgnoredDuringExecution
requiredDuringSchedulingIgnoredDuringExecution
The first rule, preferredDuringSchedulingIgnoredDuringExecution
, is a soft rule. It indicates a preferred node with the specified label values for the pod to be scheduled on. But it also may choose a node that violets one or more of the expressions, if none is matching.
The second rule, requiredDuringSchedulingIgnoredDuringExecution
, is a hard rule, meaning if the node selection expression doesn't resolve a node, the pod won't get scheduled. This is the behavior we've already seen with option 1 when using the simple nodeSelector
.
The hard rule
Let's start with requiredDuringSchedulingIgnoredDuringExecution
and mimic the nodeSelector
behavior from the first option. The pod defined below will only get scheduled if a node with the label agentpool=foobarpool
is availeble.
apiVersion: v1
kind: Pod
metadata:
namespace: demo
name: myapp
labels:
name: myapp
spec:
containers:
- name: myapp
image: nginx:latest
resources:
limits:
memory: "128Mi"
cpu: "500m"
ports:
- containerPort: 80
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: agentpool
operator: In
values:
- foobarpool
This type of node selector rules provide a lot of additional flexibility. For example you can add multiple matchExpressions
blocks to form OR conditions. The pod manifest below will get scheduled on nodes having one OR the other label.
apiVersion: v1
kind: Pod
metadata:
...
spec:
...
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: agentpool
operator: In
values:
- foobarpool
- matchExpressions:
- key: tier
operator: In
values:
- memory-optimized
To form AND conditions you'd add multiple keys, e.g. like so. Here, only nodes will be selected that have both labels set.
apiVersion: v1
kind: Pod
metadata:
...
spec:
...
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: agentpool
operator: In
values:
- foobarpool
- key: tier
operator: In
values:
- memory-optimized
🔎 It's worth noting that theoperator
allows for additional flexibility and can take the following arguments:DoesNotExists
,Exists
,Gt
,Lt
,In
,NotIn
.
The soft rule
As already mentioned, the soft rule defines a preference. The affinity definition below will give priority to a node with the label tier=general-purpose
, if that preference can't be fulfilled, a different node in state ready
will be selected.
apiVersion: v1
kind: Pod
metadata:
...
spec:
...
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: tier
operator: In
values:
- general-purpose
weight: 100
You can even define weights, which act as a tiebreaker in case multiple conditions are fullfilled. The definition below will first select a node with label tier=general-purpose
, if such a node is not available, look for one with agentpool=foobarpool
, if such a node is also not available choose whichever node is in state ready
.
apiVersion: v1
kind: Pod
metadata:
...
spec:
...
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: tier
operator: In
values:
- general-purpose
weight: 100
- preference:
matchExpressions:
- key: agentpool
operator: In
values:
- foobarpool
weight: 90
The preferences above are evaluated by an OR condition. But nothing stops us from combining them. Below, the evaluation term becomes something like
1 * (tier=general-purpose && agentpool=memory-optimized) || 0.9 * agentpool=foobarpool || any node in state ready
apiVersion: v1
kind: Pod
metadata:
...
spec:
...
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: tier
operator: In
values:
- general-purpose
- key: agentpool
operator: In
values:
- memory-optimizied
weight: 100
- preference:
matchExpressions:
- key: agentpool
operator: In
values:
- foobarpool
weight: 90
So far, we have only matched against labels by using matchExpressions
. But there is another node selector term that can be used for selecting by fields called, well matchFields
. Here is an example
apiVersion: v1
kind: Pod
metadata:
...
spec:
...
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- aks-foobarpool-37985905-vmss000000
That's it! I hope you enjoyed reading this article. In case of any questions or comments, please leave a message! Happy scheduling! 🤓
Summary
- A label added to an AKS node pool will get inherited by all nodes
- Azure auto creates a label for each node pool, e.g.,
agentpool=foobarpool
nodeSelector
is a hard constraint that, if unfulfillable, can lead to unscheduled podsnodeSelector
can take multiple labels, that all need to be fulfilledaffinity.nodeAffinity
allows binding pods to nodes, whereasaffinity.podAffinity
andaffinity.podAntiAffinity
allow grouping multiple pods together on a single node respectively keeping them apart from each otherrequiredDuringSchedulingIgnoredDuringExecution
is a hard affinity rule similar tonodeSelector
. If the conditions are not fulfilled the pod won't get scheduledpreferredDuringSchedulingIgnoredDuringExecution
is a soft rule defining preferences. If the expressions don't match, the pod can still get scheduled on another node- To match against labels, use
matchExpressions
, to match against fields usematchFields