如何解决:The connection to the server :6443 was refused – did you specify the right host or port?

问题描述

由于断电停机,kubernetes集群挂掉,使用任意kubectl 命令会报错:The connection to the server ip:6443 was refused – did you specify the right host or port,重启kubelet也不能恢复,etcd读取数据报错,数据文件损坏

排查过程

首先检查服务是否启动有无报错

[root@pengfei-master1 ~]# systemctl status kubelet  ● kubelet.service - kubelet: The Kubernetes Node Agent    Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)   Drop-In: /usr/lib/systemd/system/kubelet.service.d            └─10-kubeadm.conf    Active: active (running) since 三 2023-07-19 17:47:06 CST; 35min ago      Docs: https://kubernetes.io/docs/  Main PID: 3087 (kubelet)     Tasks: 14    Memory: 46.8M    CGroup: /system.slice/kubelet.service            └─3087 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/con...  7月 19 18:22:36 pengfei-master1 kubelet[3087]: E0719 18:22:36.296785    3087 kubelet.go:2448] "Error getting node" err="node \"pengfei-master1\" not found" 7月 19 18:22:37 pengfei-master1 kubelet[3087]: E0719 18:22:37.105683    3087 kubelet.go:2448] "Error getting node" err="node \"pengfei-master1\" not found" 7月 19 18:22:37 pengfei-master1 kubelet[3087]: E0719 18:22:37.149596    3087 eviction_manager.go:256] "Eviction manager: failed to get summary stats" err="fa...not found"

有报错,找不到master节点,继续查看kubelet还是有报错

[root@pengfei-master1 ~]# journalctl -u kubelet 7月 19 17:34:00 pengfei-master1 kubelet[3480]: E0719 17:34:00.711540    3480 kubelet.go:2448] "Error getting node" err="node \"pengfei-master1\" not found" 7月 19 17:34:00 pengfei-master1 kubelet[3480]: E0719 17:34:00.813511    3480 kubelet.go:2448] "Error getting node" err="node \"pengfei-master1\" not found"

在确认aip-server有没有挂掉,如果挂了去查看日志

查看apiserver是否挂掉

注意:如果你使用的是docker,执行这个docker ps -a| grep kube-apiserver

[root@pengfei-master1 ~]# crictl ps -a| grep kube-apiserver  I0719 18:02:28.083463    4551 util_unix.go:103] "Using this endpoint is deprecated, please consider using full URL format" endpoint="/run/containerd/containerd.sock" URL="unix:///run/containerd/containerd.sock" e6284e624f40a       4d2edfd10d3e3       2 minutes ago       Exited              kube-apiserver            62                  0b9b24371d25f       kube-apiserver-pengfei-master1

可以看出apiserver已经退出了

查看etcd是否挂掉

注意:如果你使用的是docker,执行这个docker ps -a| grep etcd

[root@pengfei-master1 ~]# crictl ps -a| grep etcd  I0719 18:03:10.350597    4563 util_unix.go:103] "Using this endpoint is deprecated, please consider using full URL format" endpoint="/run/containerd/containerd.sock" URL="unix:///run/containerd/containerd.sock" 9bc45fb63f604       a8a176a5d5d69       5 minutes ago       Exited              etcd                      90                  614e62c0c8ed0       etcd-pengfei-master1

可以看出etcd也已经退出了

接下来查看etcd日志,分析为什么会退出

/var/log/pods/kube-system_etcd(此处使用tab补全,进入后查看相应的日志报错,根据相应日志去处理对应问题),我这里是/var/log/pods/kube-system_etcd-pengfei-master1_df91a367268810494a84463207726090/etcd

异常信息

2023-07-19T18:03:17.046456705+08:00 stderr F panic: freepages: failed to get all reachable pages (page 700: multiple references) 2023-07-19T18:03:17.046481181+08:00 stderr F  2023-07-19T18:03:17.046483923+08:00 stderr F goroutine 78 [running]: 2023-07-19T18:03:17.046485818+08:00 stderr F go.etcd.io/bbolt.(*DB).freepages.func2(0xc00007c5a0) 2023-07-19T18:03:17.046487431+08:00 stderr F    /go/pkg/mod/go.etcd.io//db.go:1056 +0xe9 2023-07-19T18:03:17.046524506+08:00 stderr F created by go.etcd.io/bbolt.(*DB).freepages 2023-07-19T18:03:17.046528444+08:00 stderr F    /go/pkg/mod/go.etcd.io//db.go:1054 +0x1cd

ectd 在读取数据时发生了错误,导致启动失败。继而api-server也无法启动

etcd的数据文件损坏了,要做数据恢复,而我这是实验环境,没搞etcd备份就只能重置集群了

注意,线上使用etcd一定要做高可用和定期备份,否则就悲催了

重置k8s集群

需要在每台机器上执行

kubeadm reset
kubeadm 重置集群

删除$HOME/.kube

rm -rf $HOME/.kube

初始化集群

近期文章:

在master节点执行即可

kubeadm init --config=kubeadm.yaml --ignore-preflight-errors=SystemVerification
kubeadm初始化集群

创建必要文件

 mkdir -p $HOME/.kube  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config  sudo chown $(id -u):$(id -g) $HOME/.kube/config

将工作节点重新加入集群

查看将工作节点加入集群的命令

kubeadm token create --print-join-command

工作节点node1集群机器

kubeadm join 192.168.5.132:6443 --token abcdef.0123456789abcdef \ 	--discovery-token-ca-cert-hash sha256:843aae90033aa0f5a3bff1fc8fc977aeea2f423e50b5b991dfb0f5c9971a3c1b

工作节点node2集群机器

kubeadm join 192.168.5.132:6443 --token abcdef.0123456789abcdef \ 	--discovery-token-ca-cert-hash sha256:843aae90033aa0f5a3bff1fc8fc977aeea2f423e50b5b991dfb0f5c9971a3c1b

查看node状态

kubectl get node
kubectl 获取节点

此时集群节点恢复正常了

但是查看kube-system系统组件缺少calico和coredns

k8s缺少calico组件coredns异常

重新安装calico

#下载calico yaml文件,calico文件不用修改,直接拿来用 wget https://docs.projectcalico.org/manifests/calico.yaml #安装calico kubectl apply -f  calico.yaml #再次查看系统组件 kubectl get pod -n kube-system
k8scoredns恢复正常

测试calico和coredns

执行下面命令

#创建busybox容器,以供测试 kubectl run busybox --image docker.io/library/busybox:1.28  --image-pull-policy=IfNotPresent --restart=Never --rm -it busybox -- sh #测试calico ping www.baidu.com #测试dns nslookup kubernetes.default.svc.cluster.local
测试calico和coredns

此时集群才算正常恢复正常了

如果想要每个节点都能执行kubectl 命令将master节点$HOME/.kube目录复制到其他节点就可以了

etcd备份和恢复

下一文在详细说明