1
00:00:02,440 --> 00:00:07,360
In this lecture we will see different ways of troubleshooting worker node failures.

2
00:00:07,600 --> 00:00:13,630
Again, we start by checking the status of the nodes in the cluster.  Are they reported as Ready or

3
00:00:13,630 --> 00:00:14,580
not ready.

4
00:00:14,890 --> 00:00:20,740
If they are reported as Not Ready, check details about the nodes using the kubectl describe node

5
00:00:20,770 --> 00:00:27,290
command. Each node has a set of conditions that can point us in a direction as to why a node might have

6
00:00:27,290 --> 00:00:34,790
failed. Depending on the status they are either set to true or false or unknown. When the node is out

7
00:00:34,790 --> 00:00:35,810
of this space

8
00:00:35,900 --> 00:00:41,840
the out of disk flag is that a true. When a node is out of memory the memory pressure flag is set to

9
00:00:41,840 --> 00:00:47,650
true. When the disk capacity is low, the disk pressure flag is set to true.

10
00:00:47,660 --> 00:00:52,870
Similarly when there are too many processes the PID Pressure flag is set to true.

11
00:00:53,000 --> 00:00:57,620
And finally if the node as a whole is health the ready flag is set to true.

12
00:00:57,680 --> 00:01:04,190
When a worker node stops communicating with the master, may be due to a crash, these statuses are set to

13
00:01:04,190 --> 00:01:05,450
unknown.

14
00:01:05,480 --> 00:01:08,570
This can indicate a possible loss of a node.

15
00:01:08,690 --> 00:01:16,090
Check the last Heart Beat time  field to find out the time when the node might have crashed. In such cases

16
00:01:16,240 --> 00:01:18,970
proceed to checking the status of the node itself.

17
00:01:19,150 --> 00:01:24,340
If the node is online at all or is crashed if it's crashed bring it back up.

18
00:01:24,670 --> 00:01:28,320
Check for possible CPU, Memory, and Disk space on the node.

19
00:01:29,560 --> 00:01:31,030
Check the status of the kubelet.

20
00:01:33,730 --> 00:01:36,250
Check the kubelet logs for possible issues.

21
00:01:39,460 --> 00:01:45,880
Check the kubelet certificates. Ensure they are not expired and that they are part of the right group, and that

22
00:01:45,880 --> 00:01:48,260
the certificates are issued by the right CA.

23
00:01:48,820 --> 00:01:50,710
Well that's it for this lecture.

24
00:01:50,710 --> 00:01:55,150
Head over to the practice test and practice fixing broken clusters.
