如何刪除不一致狀態下的 rc,deployment,service
在某些情況下,經常發現 kubectl 進程掛起現象,然後在 get 時候發現刪了一半,而另外的刪除不了
[root@k8s-master ~]# kubectl get -f fluentd-elasticsearch/ NAME DESIRED CURRENT READY AGE rc/elasticsearch-logging-v1 0 2 2 15h NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE deploy/kibana-logging 0 1 1 1 15h Error from server (NotFound): services "elasticsearch-logging" not found Error from server (NotFound): daemonsets.extensions "fluentd-es-v1.22" not found Error from server (NotFound): services "kibana-logging" not found
刪除這些 deployment,service 或者 rc 命令如下:
kubectl delete deployment kibana-logging -n kube-system --cascade=false kubectl delete deployment kibana-logging -n kube-system --ignore-not-found delete rc elasticsearch-logging-v1 -n kube-system --force now --grace-period=0 1|2刪除不了後如何重置etcd
刪除不了後如何重置 etcd
rm -rf /var/lib/etcd/*
刪除後重新 reboot master 結點。
reset etcd 後需要重新設置網絡
etcdctl mk /atomic.io/network/config '{ "Network": "192.168.0.0/16" }'
啓動 apiserver 失敗
每次啓動都是報如下問題:
start request repeated too quickly for kube-apiserver.service
但其實不是啓動頻率問題,需要查看, /var/log/messages,在我的情況中是因爲開啓 ServiceAccount 後找不到 ca.crt 等文件,導致啓動出錯
May 21 07:56:41 k8s-master kube-apiserver: Flag --port has been deprecated, see --insecure-port instead. May 21 07:56:41 k8s-master kube-apiserver: F0521 07:56:41.692480 4299 universal_validation.go:104] Validate server run options failed: unable to load client CA file: open /var/run/kubernetes/ca.crt: no such file or directory May 21 07:56:41 k8s-master systemd: kube-apiserver.service: main process exited, code=exited, status=255/n/a May 21 07:56:41 k8s-master systemd: Failed to start Kubernetes API Server. May 21 07:56:41 k8s-master systemd: Unit kube-apiserver.service entered failed state. May 21 07:56:41 k8s-master systemd: kube-apiserver.service failed. May 21 07:56:41 k8s-master systemd: kube-apiserver.service holdoff time over, scheduling restart. May 21 07:56:41 k8s-master systemd: start request repeated too quickly for kube-apiserver.service May 21 07:56:41 k8s-master systemd: Failed to start Kubernetes API Server.
在部署 fluentd 等日誌組件的時候,很多問題都是因爲需要開啓 ServiceAccount 選項需要配置安全導致,所以說到底還是需要配置好 ServiceAccount.
出現 Permission denied 情況
在配置 fluentd 時候出現cannot create /var/log/fluentd.log: Permission denied 錯誤,這是因爲沒有關掉 SElinux 安全導致。可以在 /etc/selinux/config 中將 SELINUX=enforcing 設置成 disabled,然後 reboot
基於 ServiceAccount 的配置
首先生成各種需要的 keys,k8s-master 需替換成 master 的主機名.
openssl genrsa -out ca.key 2048 openssl req -x509 -new -nodes -key ca.key -subj "/CN=k8s-master" -days 10000 -out ca.crt openssl genrsa -out server.key 2048 echo subjectAltName=IP:10.254.0.1 > extfile.cnf #ip由下述命令決定 #kubectl get services --all-namespaces |grep 'default'|grep 'kubernetes'|grep '443'|awk '{print $3}' openssl req -new -key server.key -subj "/CN=k8s-master" -out server.csr openssl x509 -req -in server.csr -CA ca.crt -CAkey ca.key -CAcreateserial -extfile extfile.cnf -out server.crt -days 10000
如果修改 /etc/kubernetes/apiserver 的配置文件參數的話,通過 systemctl start kube-apiserver 啓動失敗,出錯信息爲:
Validate server run options failed: unable to load client CA file: open /root/keys/ca.crt: permission denied
但可以通過命令行啓動 API Server
/usr/bin/kube-apiserver --logtostderr=true --v=0 --etcd-servers=http://k8s-master:2379 --address=0.0.0.0 --port=8080 --kubelet-port=10250 --allow-privileged=true --service-cluster-ip-range=10.254.0.0/16 --admission-control=ServiceAccount --insecure-bind-address=0.0.0.0 --client-ca-file=/root/keys/ca.crt --tls-cert-file=/root/keys/server.crt --tls-private-key-file=/root/keys/server.key --basic-auth-file=/root/keys/basic_auth.csv --secure-port=443 &>> /var/log/kubernetes/kube-apiserver.log &
命令行啓動 Controller-manager
/usr/bin/kube-controller-manager --logtostderr=true --v=0 --master=http://k8s-master:8080 --root-ca-file=/root/keys/ca.crt --service-account-private-key-file=/root/keys/server.key & >>/var/log/kubernetes/kube-controller-manage.log
ETCD 啓動不起來-問題<1>
etcd 是 kubernetes 集羣的 zookeeper 進程,幾乎所有的 service 都依賴於 etcd 的啓動,比如 flanneld,apiserver,docker.....
在啓動 etcd 是報錯日誌如下:
May 24 13:39:09 k8s-master systemd: Stopped Flanneld overlay address etcd agent. May 24 13:39:28 k8s-master systemd: Starting Etcd Server... May 24 13:39:28 k8s-master etcd: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://etcd:2379,http://etcd:4001 May 24 13:39:28 k8s-master etcd: recognized environment variable ETCD_NAME, but unused: shadowed by corresponding flag May 24 13:39:28 k8s-master etcd: recognized environment variable ETCD_DATA_DIR, but unused: shadowed by corresponding flag May 24 13:39:28 k8s-master etcd: recognized environment variable ETCD_LISTEN_CLIENT_URLS, but unused: shadowed by corresponding flag May 24 13:39:28 k8s-master etcd: etcd Version: 3.1.3 May 24 13:39:28 k8s-master etcd: Git SHA: 21fdcc6 May 24 13:39:28 k8s-master etcd: Go Version: go1.7.4 May 24 13:39:28 k8s-master etcd: Go OS/Arch: linux/amd64 May 24 13:39:28 k8s-master etcd: setting maximum number of CPUs to 1, total number of available CPUs is 1 May 24 13:39:28 k8s-master etcd: the server is already initialized as member before, starting as etcd member... May 24 13:39:28 k8s-master etcd: listening for peers on http://localhost:2380 May 24 13:39:28 k8s-master etcd: listening for client requests on 0.0.0.0:2379 May 24 13:39:28 k8s-master etcd: listening for client requests on 0.0.0.0:4001 May 24 13:39:28 k8s-master etcd: recovered store from snapshot at index 140014 May 24 13:39:28 k8s-master etcd: name = master May 24 13:39:28 k8s-master etcd: data dir = /var/lib/etcd/default.etcd May 24 13:39:28 k8s-master etcd: member dir = /var/lib/etcd/default.etcd/member May 24 13:39:28 k8s-master etcd: heartbeat = 100ms May 24 13:39:28 k8s-master etcd: election = 1000ms May 24 13:39:28 k8s-master etcd: snapshot count = 10000 May 24 13:39:28 k8s-master etcd: advertise client URLs = http://etcd:2379,http://etcd:4001 May 24 13:39:28 k8s-master etcd: ignored file 0000000000000001-0000000000012700.wal.broken in wal May 24 13:39:29 k8s-master etcd: restarting member 8e9e05c52164694d in cluster cdf818194e3a8c32 at commit index 148905 May 24 13:39:29 k8s-master etcd: 8e9e05c52164694d became follower at term 12 May 24 13:39:29 k8s-master etcd: newRaft 8e9e05c52164694d [peers: [8e9e05c52164694d], term: 12, commit: 148905, applied: 140014, lastindex: 148905, lastterm: 12] May 24 13:39:29 k8s-master etcd: enabled capabilities for version 3.1 May 24 13:39:29 k8s-master etcd: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32 from store May 24 13:39:29 k8s-master etcd: set the cluster version to 3.1 from store May 24 13:39:29 k8s-master etcd: starting server... [version: 3.1.3, cluster version: 3.1] May 24 13:39:29 k8s-master etcd: raft save state and entries error: open /var/lib/etcd/default.etcd/member/wal/0.tmp: is a directory May 24 13:39:29 k8s-master systemd: etcd.service: main process exited, code=exited, status=1/FAILURE May 24 13:39:29 k8s-master systemd: Failed to start Etcd Server. May 24 13:39:29 k8s-master systemd: Unit etcd.service entered failed state. May 24 13:39:29 k8s-master systemd: etcd.service failed. May 24 13:39:29 k8s-master systemd: etcd.service holdoff time over, scheduling restart.
核心語句:
raft save state and entries error: open /var/lib/etcd/default.etcd/member/wal/0.tmp: is a directory
進入相關目錄,刪除 0.tmp,然後就可以啓動啦!
ETCD啓動不起來-超時問題<2>
問題背景:當前部署了 3 個 etcd 節點,突然有一天 3 臺集羣全部停電宕機了。重新啓動之後發現 K8S 集羣是可以正常使用的,但是檢查了一遍組件之後,發現有一個節點的 etcd 啓動不了。
經過一遍探查,發現時間不準確,通過以下命令 ntpdate ntp.aliyun.com 重新將時間調整正確,重新啓動 etcd,發現還是起不來,報錯如下:
Mar 05 14:27:15 k8s-node2 etcd[3248]: etcd Version: 3.3.13 Mar 05 14:27:15 k8s-node2 etcd[3248]: Git SHA: 98d3084 Mar 05 14:27:15 k8s-node2 etcd[3248]: Go Version: go1.10.8 Mar 05 14:27:15 k8s-node2 etcd[3248]: Go OS/Arch: linux/amd64 Mar 05 14:27:15 k8s-node2 etcd[3248]: setting maximum number of CPUs to 4, total number of available CPUs is 4 Mar 05 14:27:15 k8s-node2 etcd[3248]: the server is already initialized as member before, starting as etcd member ... Mar 05 14:27:15 k8s-node2 etcd[3248]: peerTLS: cert = /opt/etcd/ssl/server.pem, key = /opt/etcd/ssl/server-key.pe m, ca = , trusted-ca = /opt/etcd/ssl/ca.pem, client-cert-auth = false, crl-file = Mar 05 14:27:15 k8s-node2 etcd[3248]: listening for peers on https://192.168.25.226:2380 Mar 05 14:27:15 k8s-node2 etcd[3248]: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert files are presented. Ignored key/cert files. Mar 05 14:27:15 k8s-node2 etcd[3248]: listening for client requests on 127.0.0.1:2379 Mar 05 14:27:15 k8s-node2 etcd[3248]: listening for client requests on 192.168.25.226:2379 Mar 05 14:27:15 k8s-node2 etcd[3248]: member 9c166b8b7cb6ecb8 has already been bootstrapped Mar 05 14:27:15 k8s-node2 systemd[1]: etcd.service: main process exited, code=exited, status=1/FAILURE Mar 05 14:27:15 k8s-node2 systemd[1]: Failed to start Etcd Server. Mar 05 14:27:15 k8s-node2 systemd[1]: Unit etcd.service entered failed state. Mar 05 14:27:15 k8s-node2 systemd[1]: etcd.service failed. Mar 05 14:27:15 k8s-node2 systemd[1]: etcd.service failed. Mar 05 14:27:15 k8s-node2 systemd[1]: etcd.service holdoff time over, scheduling restart. Mar 05 14:27:15 k8s-node2 systemd[1]: Starting Etcd Server... Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_NAME, but unused: shadowed by correspo nding flag Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_DATA_DIR, but unused: shadowed by corr esponding flag Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_LISTEN_PEER_URLS, but unused: shadowed by corresponding flag Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_LISTEN_CLIENT_URLS, but unused: shadow ed by corresponding flag Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS, but unuse d: shadowed by corresponding flag Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_ADVERTISE_CLIENT_URLS, but unused: sha dowed by corresponding flag Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_INITIAL_CLUSTER, but unused: shadowed by corresponding flag Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_INITIAL_CLUSTER_TOKEN, but unused: sha dowed by corresponding flag Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_INITIAL_CLUSTER_STATE, but unused: sha dowed by corresponding flag
解決方法:
檢查日誌發現並沒有特別明顯的錯誤,根據經驗來講,etcd 節點壞掉一個其實對集羣沒有大的影響,這時集羣已經可以正常使用了,但是這個壞掉的 etcd 節點並沒有啓動,解決方法如下:
- 進入 etcd 的數據存儲目錄進行備份
備份原有數據:
cd /var/lib/etcd/default.etcd/member/
cp * /data/bak/
- 刪除這個目錄下的所有數據文件
rm -rf /var/lib/etcd/default.etcd/member/*
- 停止另外兩臺 etcd 節點,因爲 etcd 節點啓動時需要所有節點一起啓動,啓動成功後即可使用。
master 節點 systemctl stop etcd systemctl restart etcd node1 節點 systemctl stop etcd systemctl restart etcd node2 節點 systemctl stop etcd systemctl restart etcd
CentOS下配置主機互信
在每臺服務器需要建立主機互信的用戶名執行以下命令生成公鑰/密鑰,默認回車即可
ssh-keygen -t rsa
可以看到生成個公鑰的文件
互傳公鑰,第一次需要輸入密碼,之後就OK了
ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.199.132 (-p 2222)
-p 端口 默認端口不加 -p,如果更改過端口,就得加上 -p . 可以看到是在 .ssh/ 下生成了個 authorized_keys 的文件,記錄了能登陸這臺服務器的其他服務器的公鑰
測試看是否能登陸
ssh 192.168.199.132 (-p 2222)
CentOS 主機名的修改
hostnamectl set-hostname k8s-master1
Virtualbox 實現 CentOS 複製和粘貼功能
如果不安裝或者不輸出,可以將 update 修改成 install 再運行
yum install update yum update kernel yum update kernel-devel yum install kernel-headers yum install gcc yum install gcc make
運行完後 sh VBoxLinuxAdditions.run
刪除 Pod 一直處於 Terminating 狀態
可以通過下面命令強制刪除
kubectl delete pod NAME --grace-period=0 --force
刪除 namespace 一直處於 Terminating 狀態
可以通過以下腳本強制刪除
[root@k8s-master1 k8s]# cat delete-ns.sh #!/bin/bash set -e useage(){ echo "useage:" echo " delns.sh NAMESPACE" } if [ $# -lt 1 ];then useage exit fi NAMESPACE=$1 JSONFILE=${NAMESPACE}.json kubectl get ns "${NAMESPACE}" -o json > "${JSONFILE}" vi "${JSONFILE}" curl -k -H "Content-Type: application/json" -X PUT --data-binary @"${JSONFLE}" http://127.0.0.1:8001/api/v1/namespaces/"${NAMESPACE}"/finalize
容器包含有效的 CPU/內存 requests 且沒有指定 limits 可能會出現什麼問題?
下面我們創建一個對應的容器,該容器只有 requests 設定,但是沒有 limits 設定,
- name: busybox-cnt02 image: busybox command: ["/bin/sh"] args: ["-c", "while true; do echo hello from cnt02; sleep 10;done"] resources: requests: memory: "100Mi" cpu: "100m"
這個容器創建出來會有什麼問題呢?
其實對於正常的環境來說沒有什麼問題,但是對於資源型 pod 來說,如果有的容器沒有設定 limit 限制,資源會被其他的 pod 搶佔走,可能會造成容器應用失敗的情況。可以通過 limitrange 策略來去匹配,讓 pod 自動設定,前提是要提前配置好limitrange 規則。