kubernetes 常見問題總結

如何刪除不一致狀態下的 rc,deployment,service

在某些情況下,經常發現 kubectl 進程掛起現象,然後在 get 時候發現刪了一半,而另外的刪除不了

[root@k8s-master ~]# kubectl get -f fluentd-elasticsearch/
NAME DESIRED CURRENT READY AGE
rc/elasticsearch-logging-v1 0 2 2 15h
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deploy/kibana-logging 0 1 1 1 15h
Error from server (NotFound): services "elasticsearch-logging" not found
Error from server (NotFound): daemonsets.extensions "fluentd-es-v1.22" not found
Error from server (NotFound): services "kibana-logging" not found

刪除這些 deployment,service 或者 rc 命令如下:

kubectl delete deployment kibana-logging -n kube-system --cascade=false
kubectl delete deployment kibana-logging -n kube-system  --ignore-not-found
delete rc elasticsearch-logging-v1 -n kube-system --force now --grace-period=0
1|2刪除不了後如何重置etcd

 

刪除不了後如何重置 etcd

rm -rf /var/lib/etcd/*

刪除後重新 reboot master 結點。

reset etcd 後需要重新設置網絡

etcdctl mk /atomic.io/network/config '{ "Network": "192.168.0.0/16" }'

 

啓動 apiserver 失敗

每次啓動都是報如下問題:

start request repeated too quickly for kube-apiserver.service

但其實不是啓動頻率問題,需要查看, /var/log/messages,在我的情況中是因爲開啓    ServiceAccount 後找不到 ca.crt 等文件,導致啓動出錯

May 21 07:56:41 k8s-master kube-apiserver: Flag --port has been deprecated, see --insecure-port instead.
May 21 07:56:41 k8s-master kube-apiserver: F0521 07:56:41.692480 4299 universal_validation.go:104] Validate server run options failed: unable to load client CA file: open /var/run/kubernetes/ca.crt: no such file or directory
May 21 07:56:41 k8s-master systemd: kube-apiserver.service: main process exited, code=exited, status=255/n/a
May 21 07:56:41 k8s-master systemd: Failed to start Kubernetes API Server.
May 21 07:56:41 k8s-master systemd: Unit kube-apiserver.service entered failed state.
May 21 07:56:41 k8s-master systemd: kube-apiserver.service failed.
May 21 07:56:41 k8s-master systemd: kube-apiserver.service holdoff time over, scheduling restart.
May 21 07:56:41 k8s-master systemd: start request repeated too quickly for kube-apiserver.service
May 21 07:56:41 k8s-master systemd: Failed to start Kubernetes API Server.

在部署 fluentd 等日誌組件的時候,很多問題都是因爲需要開啓 ServiceAccount 選項需要配置安全導致,所以說到底還是需要配置好 ServiceAccount.

 

出現 Permission denied 情況

在配置 fluentd 時候出現cannot create /var/log/fluentd.log: Permission denied 錯誤,這是因爲沒有關掉 SElinux 安全導致。可以在 /etc/selinux/config 中將 SELINUX=enforcing 設置成 disabled,然後 reboot

 

基於 ServiceAccount 的配置

首先生成各種需要的 keys,k8s-master 需替換成 master 的主機名.

openssl genrsa -out ca.key 2048
openssl req -x509 -new -nodes -key ca.key -subj "/CN=k8s-master" -days 10000 -out ca.crt
openssl genrsa -out server.key 2048
echo subjectAltName=IP:10.254.0.1 > extfile.cnf
#ip由下述命令決定
#kubectl get services --all-namespaces |grep 'default'|grep 'kubernetes'|grep '443'|awk '{print $3}'
openssl req -new -key server.key -subj "/CN=k8s-master" -out server.csr
openssl x509 -req -in server.csr -CA ca.crt -CAkey ca.key -CAcreateserial -extfile extfile.cnf -out server.crt -days 10000

如果修改 /etc/kubernetes/apiserver 的配置文件參數的話,通過 systemctl start kube-apiserver 啓動失敗,出錯信息爲:

Validate server run options failed: unable to load client CA fileopen /root/keys/ca.crt: permission denied

但可以通過命令行啓動 API Server

/usr/bin/kube-apiserver --logtostderr=true --v=0 --etcd-servers=http://k8s-master:2379 --address=0.0.0.0 --port=8080 --kubelet-port=10250 --allow-privileged=true --service-cluster-ip-range=10.254.0.0/16 --admission-control=ServiceAccount --insecure-bind-address=0.0.0.0 --client-ca-file=/root/keys/ca.crt --tls-cert-file=/root/keys/server.crt --tls-private-key-file=/root/keys/server.key --basic-auth-file=/root/keys/basic_auth.csv --secure-port=443 &>> /var/log/kubernetes/kube-apiserver.log &

命令行啓動 Controller-manager

/usr/bin/kube-controller-manager --logtostderr=true --v=0 --master=http://k8s-master:8080 --root-ca-file=/root/keys/ca.crt --service-account-private-key-file=/root/keys/server.key & >>/var/log/kubernetes/kube-controller-manage.log

 

ETCD 啓動不起來-問題<1>

etcd 是 kubernetes 集羣的 zookeeper 進程,幾乎所有的 service 都依賴於 etcd  的啓動,比如 flanneld,apiserver,docker.....

在啓動 etcd 是報錯日誌如下:

May 24 13:39:09 k8s-master systemd: Stopped Flanneld overlay address etcd agent.
May 24 13:39:28 k8s-master systemd: Starting Etcd Server...
May 24 13:39:28 k8s-master etcd: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://etcd:2379,http://etcd:4001
May 24 13:39:28 k8s-master etcd: recognized environment variable ETCD_NAME, but unused: shadowed by corresponding flag
May 24 13:39:28 k8s-master etcd: recognized environment variable ETCD_DATA_DIR, but unused: shadowed by corresponding flag
May 24 13:39:28 k8s-master etcd: recognized environment variable ETCD_LISTEN_CLIENT_URLS, but unused: shadowed by corresponding flag
May 24 13:39:28 k8s-master etcd: etcd Version: 3.1.3
May 24 13:39:28 k8s-master etcd: Git SHA: 21fdcc6
May 24 13:39:28 k8s-master etcd: Go Version: go1.7.4
May 24 13:39:28 k8s-master etcd: Go OS/Arch: linux/amd64
May 24 13:39:28 k8s-master etcd: setting maximum number of CPUs to 1, total number of available CPUs is 1
May 24 13:39:28 k8s-master etcd: the server is already initialized as member before, starting as etcd member...
May 24 13:39:28 k8s-master etcd: listening for peers on http://localhost:2380
May 24 13:39:28 k8s-master etcd: listening for client requests on 0.0.0.0:2379
May 24 13:39:28 k8s-master etcd: listening for client requests on 0.0.0.0:4001
May 24 13:39:28 k8s-master etcd: recovered store from snapshot at index 140014
May 24 13:39:28 k8s-master etcd: name = master
May 24 13:39:28 k8s-master etcd: data dir = /var/lib/etcd/default.etcd
May 24 13:39:28 k8s-master etcd: member dir = /var/lib/etcd/default.etcd/member
May 24 13:39:28 k8s-master etcd: heartbeat = 100ms
May 24 13:39:28 k8s-master etcd: election = 1000ms
May 24 13:39:28 k8s-master etcd: snapshot count = 10000
May 24 13:39:28 k8s-master etcd: advertise client URLs = http://etcd:2379,http://etcd:4001
May 24 13:39:28 k8s-master etcd: ignored file 0000000000000001-0000000000012700.wal.broken in wal
May 24 13:39:29 k8s-master etcd: restarting member 8e9e05c52164694d in cluster cdf818194e3a8c32 at commit index 148905
May 24 13:39:29 k8s-master etcd: 8e9e05c52164694d became follower at term 12
May 24 13:39:29 k8s-master etcd: newRaft 8e9e05c52164694d [peers: [8e9e05c52164694d], term: 12, commit: 148905, applied: 140014, lastindex: 148905, lastterm: 12]
May 24 13:39:29 k8s-master etcd: enabled capabilities for version 3.1
May 24 13:39:29 k8s-master etcd: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32 from store
May 24 13:39:29 k8s-master etcd: set the cluster version to 3.1 from store
May 24 13:39:29 k8s-master etcd: starting server... [version: 3.1.3, cluster version: 3.1]
May 24 13:39:29 k8s-master etcd: raft save state and entries error: open /var/lib/etcd/default.etcd/member/wal/0.tmp: is a directory
May 24 13:39:29 k8s-master systemd: etcd.service: main process exited, code=exited, status=1/FAILURE
May 24 13:39:29 k8s-master systemd: Failed to start Etcd Server.
May 24 13:39:29 k8s-master systemd: Unit etcd.service entered failed state.
May 24 13:39:29 k8s-master systemd: etcd.service failed.
May 24 13:39:29 k8s-master systemd: etcd.service holdoff time over, scheduling restart.

核心語句:

raft save state and entries error: open /var/lib/etcd/default.etcd/member/wal/0.tmp: is a directory

進入相關目錄,刪除 0.tmp,然後就可以啓動啦!

 

ETCD啓動不起來-超時問題<2>

問題背景:當前部署了 3 個 etcd 節點,突然有一天 3 臺集羣全部停電宕機了。重新啓動之後發現 K8S 集羣是可以正常使用的,但是檢查了一遍組件之後,發現有一個節點的 etcd 啓動不了。

經過一遍探查,發現時間不準確,通過以下命令 ntpdate ntp.aliyun.com 重新將時間調整正確,重新啓動 etcd,發現還是起不來,報錯如下:

Mar 05 14:27:15 k8s-node2 etcd[3248]: etcd Version: 3.3.13
Mar 05 14:27:15 k8s-node2 etcd[3248]: Git SHA: 98d3084
Mar 05 14:27:15 k8s-node2 etcd[3248]: Go Version: go1.10.8
Mar 05 14:27:15 k8s-node2 etcd[3248]: Go OS/Arch: linux/amd64
Mar 05 14:27:15 k8s-node2 etcd[3248]: setting maximum number of CPUs to 4, total number of available CPUs is 4
Mar 05 14:27:15 k8s-node2 etcd[3248]: the server is already initialized as member before, starting as etcd member
...
Mar 05 14:27:15 k8s-node2 etcd[3248]: peerTLS: cert = /opt/etcd/ssl/server.pem, key = /opt/etcd/ssl/server-key.pe
m, ca = , trusted-ca = /opt/etcd/ssl/ca.pem, client-cert-auth = false, crl-file =
Mar 05 14:27:15 k8s-node2 etcd[3248]: listening for peers on https://192.168.25.226:2380
Mar 05 14:27:15 k8s-node2 etcd[3248]: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert
files are presented. Ignored key/cert files.
Mar 05 14:27:15 k8s-node2 etcd[3248]: listening for client requests on 127.0.0.1:2379
Mar 05 14:27:15 k8s-node2 etcd[3248]: listening for client requests on 192.168.25.226:2379
Mar 05 14:27:15 k8s-node2 etcd[3248]: member 9c166b8b7cb6ecb8 has already been bootstrapped
Mar 05 14:27:15 k8s-node2 systemd[1]: etcd.service: main process exited, code=exited, status=1/FAILURE
Mar 05 14:27:15 k8s-node2 systemd[1]: Failed to start Etcd Server.
Mar 05 14:27:15 k8s-node2 systemd[1]: Unit etcd.service entered failed state.
Mar 05 14:27:15 k8s-node2 systemd[1]: etcd.service failed.
Mar 05 14:27:15 k8s-node2 systemd[1]: etcd.service failed.
Mar 05 14:27:15 k8s-node2 systemd[1]: etcd.service holdoff time over, scheduling restart.
Mar 05 14:27:15 k8s-node2 systemd[1]: Starting Etcd Server...
Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_NAME, but unused: shadowed by correspo
nding flag
Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_DATA_DIR, but unused: shadowed by corr
esponding flag
Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_LISTEN_PEER_URLS, but unused: shadowed
 by corresponding flag
Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_LISTEN_CLIENT_URLS, but unused: shadow
ed by corresponding flag
Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS, but unuse
d: shadowed by corresponding flag
Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_ADVERTISE_CLIENT_URLS, but unused: sha
dowed by corresponding flag
Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_INITIAL_CLUSTER, but unused: shadowed
by corresponding flag
Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_INITIAL_CLUSTER_TOKEN, but unused: sha
dowed by corresponding flag
Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_INITIAL_CLUSTER_STATE, but unused: sha
dowed by corresponding flag

解決方法:

檢查日誌發現並沒有特別明顯的錯誤,根據經驗來講,etcd 節點壞掉一個其實對集羣沒有大的影響,這時集羣已經可以正常使用了,但是這個壞掉的 etcd 節點並沒有啓動,解決方法如下:

  1. 進入 etcd 的數據存儲目錄進行備份

    備份原有數據:

    cd /var/lib/etcd/default.etcd/member/

    cp *  /data/bak/

  2. 刪除這個目錄下的所有數據文件

    rm -rf /var/lib/etcd/default.etcd/member/*

  3. 停止另外兩臺 etcd 節點,因爲 etcd 節點啓動時需要所有節點一起啓動,啓動成功後即可使用。
master 節點
systemctl stop etcd
systemctl restart etcd
node1 節點
systemctl stop etcd
systemctl restart etcd
node2 節點
systemctl stop etcd
systemctl restart etcd

 

CentOS下配置主機互信

在每臺服務器需要建立主機互信的用戶名執行以下命令生成公鑰/密鑰,默認回車即可

ssh-keygen -t rsa

可以看到生成個公鑰的文件

互傳公鑰,第一次需要輸入密碼,之後就OK了

ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.199.132 (-p 2222)

-p 端口 默認端口不加 -p,如果更改過端口,就得加上 -p . 可以看到是在 .ssh/ 下生成了個 authorized_keys 的文件,記錄了能登陸這臺服務器的其他服務器的公鑰

測試看是否能登陸

ssh 192.168.199.132 (-p 2222)

 

CentOS 主機名的修改

hostnamectl set-hostname k8s-master1

 

Virtualbox 實現 CentOS 複製和粘貼功能

如果不安裝或者不輸出,可以將 update 修改成 install 再運行

yum install update
yum update kernel
yum update kernel-devel
yum install kernel-headers
yum install gcc
yum install gcc make

運行完後 sh VBoxLinuxAdditions.run

 

刪除 Pod 一直處於 Terminating 狀態

可以通過下面命令強制刪除

kubectl delete pod NAME --grace-period=0 --force

 

刪除 namespace 一直處於 Terminating 狀態

可以通過以下腳本強制刪除

[root@k8s-master1 k8s]# cat delete-ns.sh
#!/bin/bash
set -e
useage(){
    echo "useage:"
    echo " delns.sh NAMESPACE"
}
if [ $# -lt 1 ];then
    useage
    exit
fi
NAMESPACE=$1
JSONFILE=${NAMESPACE}.json
kubectl get ns "${NAMESPACE}" -o json > "${JSONFILE}"
vi "${JSONFILE}"
curl -k -H "Content-Type: application/json" -X PUT --data-binary @"${JSONFLE}" 
    http://127.0.0.1:8001/api/v1/namespaces/"${NAMESPACE}"/finalize

 

容器包含有效的 CPU/內存 requests 且沒有指定 limits 可能會出現什麼問題?

下面我們創建一個對應的容器,該容器只有 requests 設定,但是沒有 limits 設定,

- name: busybox-cnt02
    image: busybox
    command: ["/bin/sh"]
    args: ["-c""while true; do echo hello from cnt02; sleep 10;done"]
    resources:
      requests:
        memory: "100Mi"
        cpu: "100m"

這個容器創建出來會有什麼問題呢?

其實對於正常的環境來說沒有什麼問題,但是對於資源型 pod 來說,如果有的容器沒有設定 limit 限制,資源會被其他的 pod 搶佔走,可能會造成容器應用失敗的情況。可以通過 limitrange 策略來去匹配,讓 pod 自動設定,前提是要提前配置好limitrange 規則。

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章