1
00:00:00,850 --> 00:00:04,120
Hello and welcome to this lecture. In this lecture

2
00:00:04,120 --> 00:00:11,560
we will discuss about scenarios where you might have to take down node as part of your cluster say

3
00:00:11,560 --> 00:00:19,820
for maintenance purposes like upgrading a base software or applying patches like security patches etc

4
00:00:19,960 --> 00:00:23,300
on your cluster. In this lecture

5
00:00:23,330 --> 00:00:27,190
We will see the options available to handle such cases.

6
00:00:27,410 --> 00:00:31,900
So you have a cluster with a few notes and pods serving applications.

7
00:00:31,910 --> 00:00:35,700
What happens when one of these nodes go down.

8
00:00:35,750 --> 00:00:43,340
Of course the pods on them are not accessible.  Now depending upon how you deployed those PODs your users

9
00:00:43,400 --> 00:00:44,420
may be impacted.

10
00:00:44,420 --> 00:00:50,780
For example, since you have multiple replicas of the blue pod, the users accessing the blue application

11
00:00:50,840 --> 00:00:56,020
are not impacted as they are being served through the other blue pod that's on line.

12
00:00:56,030 --> 00:01:03,600
. However users accessing the green pod, are impacted as that was the only pod running the green application.

13
00:01:03,740 --> 00:01:07,040
Now what does kubernetes do in this case?

14
00:01:07,310 --> 00:01:14,120
If the node came back online immediately, then the kubectl process starts and the pods come back

15
00:01:14,120 --> 00:01:14,630
onine.

16
00:01:15,170 --> 00:01:22,010
However, if the node was down for more than 5  minutes, then the pods are terminated from that node.

17
00:01:22,690 --> 00:01:25,570
Well, kubernetes considers them as dead.

18
00:01:25,730 --> 00:01:31,550
If the PODs where part of a replicaset then they are recreated on other nodes.

19
00:01:31,880 --> 00:01:38,420
The time it waits for a pod to come back online is known as the pod eviction timeout and is set on

20
00:01:38,420 --> 00:01:42,200
the controller manager with a default value of five minutes.

21
00:01:42,530 --> 00:01:43,520
So whenever a node

22
00:01:43,550 --> 00:01:51,100
goes offline, the master node waits for upto 5 minutes before considering the node dead.

23
00:01:51,320 --> 00:01:58,640
When the node comes back on line after the pod eviction timeout it comes up blank without any pods scheduled

24
00:01:58,640 --> 00:01:59,390
on it.

25
00:01:59,390 --> 00:02:03,620
Since the blue pod was part of a replicaset, it had a new pod created

26
00:02:03,650 --> 00:02:09,860
On another node. However since the green pod was not part of the replica set it's just gone.

27
00:02:11,090 --> 00:02:16,910
Thus if you have maintenance tasks to be performed on a node if you know that the workloads running

28
00:02:16,970 --> 00:02:22,390
on the Node have other replicas and if it's okay that they go down for a short period of time.

29
00:02:22,430 --> 00:02:28,370
And if you're sure the node will come back on line within five minutes you can make a quick upgrade

30
00:02:28,460 --> 00:02:29,480
and reboot.

31
00:02:29,570 --> 00:02:35,830
However you do not for sure know if a node is going to be back on line in five minutes.

32
00:02:35,870 --> 00:02:40,840
Well you cannot for sure say it is going to be back at all.

33
00:02:40,850 --> 00:02:43,400
So there is a safer way to do it.

34
00:02:43,430 --> 00:02:49,670
You can purposefully drain the node of all the workloads so that the workloads are moved to other nodes

35
00:02:49,730 --> 00:02:51,010
in the cluster.

36
00:02:51,020 --> 00:02:54,950
Well technically they are not moved. When you drain the node

37
00:02:55,010 --> 00:03:01,190
the pods are gracefully terminated from the node that they're on and recreated on another.

38
00:03:01,190 --> 00:03:09,560
The node is also cordoned or marked as unschedulable. Meaning no pods can be scheduled on this node until

39
00:03:09,620 --> 00:03:12,940
you specifically remove the restriction.

40
00:03:12,980 --> 00:03:19,200
Now that the pods are safe on the others nodes, you can reboot the first node. When it comes back online

41
00:03:19,220 --> 00:03:21,170
it is still unschedulable.

42
00:03:21,320 --> 00:03:26,170
You then need to uncordon it, so that pods can be scheduled on it again.

43
00:03:26,170 --> 00:03:32,420
Now, remember the pods that were moved to the other nodes, don’t automatically fall back. If any of those

44
00:03:32,420 --> 00:03:36,440
pods where deleted or if new pods were created in the cluster,

45
00:03:36,590 --> 00:03:38,000
Then they would be created

46
00:03:38,000 --> 00:03:46,550
on this node. Apart from drain and uncordon, there is also another command called cordon. Cordon simply

47
00:03:46,550 --> 00:03:48,300
marks a node unschedulable.

48
00:03:48,720 --> 00:03:49,830
Unlike drain

49
00:03:49,910 --> 00:03:53,820
it does not terminate or move the pods on an existing node.

50
00:03:53,900 --> 00:03:57,470
It simply makes sure that new pods are not scheduled

51
00:03:57,470 --> 00:03:58,160
on that node.

52
00:03:59,150 --> 00:04:01,130
Well that's it for this lecture.

53
00:04:01,130 --> 00:04:06,150
Head over to the practice test and practice draining, cordoning and uncordoning a node.

54
00:04:06,470 --> 00:04:08,300
I will see you in the next lecture.