Balance your Kubernetes cluster


main

I have been working with Kubernetes for many years. In my opinion it is a super versatile and flexible system that allows us to deploy and manage applications in the cloud. You could say that the adoption of the cloud can be divided into two eras, before and after Kubernetes.

Since Kubernetes was born as an Open Source project, released by Google, there have been innumerable types of projects (Open Source as well) that have been born and when I say innumerable, I am very serious. The Kubernetes Landscape is gigantic, every day a new project is born and it is quite difficult for us to keep up with it. This is due to the great reception that this project has and the adoption in all types of companies that have begun to use Kubernetes as their CORE for their cloud applications.

One of the main drawbacks I’ve encountered over the years is getting the Kubernetes de facto scheduler kube-scheduler to work in a more practical way. This is difficult to achieve, as each project, company or application has its own requirements, its own workflows, and its own lyfecycle.

For the sake of theory, what the Kubernetes scheduler does is to watches for newly created Pods that have no Node assigned. For every Pod that the scheduler discovers, the scheduler becomes responsible for finding the best Node for that Pod to run on.

The scheduler’s decisions are influenced by its view of a Kubernetes cluster at that point of time when a new pod appears for scheduling. As Kubernetes clusters are very dynamic and their state changes over time, there may be desire to move already running pods to some other nodes for various reasons:

• Some nodes are under or over utilized.

• The original scheduling decision does not hold true any more, as taints or labels are added to or removed from nodes, pod/node affinity requirements are not satisfied any more.

• Some nodes failed and their pods moved to other nodes.

• New nodes are added to clusters. For instance, you upgrade the cluster and the last node has no pods running in it.

Kube-Scheduler is the default Kubernetes scheduler and is part of the Kubernetes control plane. Of course, and this is one of the versatility of Kubernetes, and if you want, you can write your own scheduler controller to suit your needs.

But let’s say you don’t have a developer with Go experience and knowledge of the Kubernetes API and machinery. In fact, there are very few professionals with these skills at this time, however, we have some options that can help us correct and modify in some way the distribution of pods across the cluster.

A very simple tool is the descheduler. This is a controller that can be deployed to kubernetes cluster and based on its policy, finds pods that can be moved to another nodes in order to distribute the workload.

Let’s see how it works. As cluster administrator you have two alternatives:

  • Run descheduler as a Job (One time run)
kubectl create -f https://github.com/kubernetes-sigs/descheduler/blob/master/kubernetes/base/rbac.yaml
kubectl create -f https://github.com/kubernetes-sigs/descheduler/blob/master/kubernetes/base/configmap.yaml
kubectl create -f https://github.com/kubernetes-sigs/descheduler/blob/master/kubernetes/job/job.yaml
  • Run descheduler as a Cronjob (Run it periodically)
kubectl create -f https://github.com/kubernetes-sigs/descheduler/blob/master/kubernetes/base/rbac.yaml
kubectl create -f https://github.com/kubernetes-sigs/descheduler/blob/master/kubernetes/base/configmap.yaml
kubectl create -f https://github.com/kubernetes-sigs/descheduler/blob/master/kubernetes/cronjob/cronjob.yaml

That’s it, once it runs, you will notice that pods will be recreated from one node to anothers if the descheduler found out the cluster is not balanced. BE AWARE your airpods will be recreated then make sure you plan for a maintenance windows or run it in a non production environment for testing purposes as a first try.

In order to get the output of the descheduler job, let’t find the pods first:

kubectl get pods -A | grep -i descheduler-job

Example output:

kubectl logs -f descheduler-job-2b45f  -n kube-system
I1031 20:19:55.444174       1 node.go:45] node lister returned empty list, now fetch directly
I1031 20:19:55.450398       1 duplicates.go:73] Processing node: "ip-x-x-x-x.us-east-1.compute.internal"
I1031 20:19:55.469022       1 duplicates.go:73] Processing node: "ip-x-x-x-x.us-east-1.compute.internal"
I1031 20:19:55.483218       1 duplicates.go:73] Processing node: "ip-x-x-x-x.us-east-1.compute.internal"
I1031 20:19:55.522985       1 evictions.go:117] Evicted pod: "nginx-57d9c4485b-c6ql9" in namespace "agic" (RemoveDuplicatePods)
I1031 20:19:55.523228       1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"agic", Name:"nginx-57d9c4485b-c6ql9", UID:"7f83e183-1bb2-11eb-83ef-0a827969ac4c", APIVersion:"v1", ResourceVersion:"83477556", FieldPath:""}): type: 'Normal' reason: 'Descheduled' pod evicted by sigs.k8s.io/descheduler (RemoveDuplicatePods)
I1031 20:19:55.537734       1 evictions.go:117] Evicted pod: "nginx-57d9c4485b-gb7gh" in namespace "agic" (RemoveDuplicatePods)
I1031 20:19:55.537959       1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"agic", Name:"nginx-57d9c4485b-gb7gh", UID:"80ba5394-1bb2-11eb-83ef-0a827969ac4c", APIVersion:"v1", ResourceVersion:"83477981", FieldPath:""}): type: 'Normal' reason: 'Descheduled' pod evicted by sigs.k8s.io/descheduler (RemoveDuplicatePods)
I1031 20:19:55.559010       1 evictions.go:117] Evicted pod: "nginx-57d9c4485b-mk7pm" in namespace "agic" (RemoveDuplicatePods)
I1031 20:19:55.559380       1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"agic", Name:"nginx-57d9c4485b-mk7pm", UID:"850ca2f4-1bb2-11eb-83ef-0a827969ac4c", APIVersion:"v1", ResourceVersion:"83478928", FieldPath:""}): type: 'Normal' reason: 'Descheduled' pod evicted by sigs.k8s.io/descheduler (RemoveDuplicatePods)
I1031 20:19:55.586833       1 evictions.go:117] Evicted pod: "nginx-57d9c4485b-ppr2j" in namespace "agic" (RemoveDuplicatePods)
I1031 20:19:55.587047       1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"agic", Name:"nginx-57d9c4485b-ppr2j", UID:"84e6751b-1bb2-11eb-83ef-0a827969ac4c", APIVersion:"v1", ResourceVersion:"83479389", FieldPath:""}): type: 'Normal' reason: 'Descheduled' pod evicted by sigs.k8s.io/descheduler (RemoveDuplicatePods)
I1031 20:20:15.673297       1 lownodeutilization.go:101] Criteria for a node under utilization: CPU: 20, Mem: 20, Pods: 100
I1031 20:20:15.673307       1 lownodeutilization.go:108] Total number of underutilized nodes: 6
I1031 20:20:15.673313       1 lownodeutilization.go:116] all nodes are underutilized, nothing to do here

The descheduler is more way versatil and you can define your own policy and strategies, for more information please visit their README here

As usual, if you have any question, send me a message at contact@wecloudpro.com

Back to blog