1、原理 通过K8s的亲和性和反亲和性设置,可以将POD在节点间调度。
亲和性(affinity):将多个POD部署在同一个节点上,容器间通信无需转发路由,减少网络消耗。
反亲和性(anti-affinity):将多个POD部署在不同节点上,满足业务容灾冗余需求,比如节点1挂了,节点2的POD还能提供服务。不过,K8s本身也会在pod消亡时自动寻找节点部署业务。
亲和与反亲和有两种条件:
必须满足: 即硬约束,关键字为requiredDuringSchedulingIgnoredDuringExecution。
尽量满足: 即软约束,关键字为preferredDuringSchedulingIgnoredDuringExecution。
除此之外,还有几个关键概念:
● 拓扑域: 即topologyKey
,拓扑域通过设置工作节点的标签,包含默认和自定义标签,用于指定调度时作用域。比如topologyKey: failure-domain.beta.kubernetes.io/zone
指的拓扑域是AZ域,topologyKey: kubernetes.io/hostname
指的拓扑域是主机。
● 选择器: 对应于matchExpressions
,可以添加多条选择器,多条选择器之间是一种“与”的关系,即需要满足全部选择器才能依据此条规则进行调度。
● 标签名: 对应应用组件的标签,您可以使用默认标签app或者使用自定义标签。
● 匹配关系: 即操作符,可以设置四种匹配关系(In, NotIn, Exists, DoesNotExist)。 In和NotIn操作符可以添加单个值或者多个value值(多值使用;进行划分), Exists和DoesNotExist判断某个label是否存在,不需设置value值。
2、方案 2.1、前提 确保 K8S 纳管节点分布在不同的可用区(至少 2 个可用区)。比如:我们需要4 台节点,则确保2台在 AZ1、另外2台在 AZ2。
节点纳管后可查看其标签信息:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ➜ kubectl describe node Name: 10.x.154.106 Roles: <none> Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=Si3.xlarge.2 beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=cn-southwest-1 failure-domain.beta.kubernetes.io/zone=cn-southwest-1a kubernetes.io/arch=amd64 kubernetes.io/hostname=10.x.154.106 kubernetes.io/os=linux node.kubernetes.io/baremetal=false node.kubernetes.io/container-engine=docker node.kubernetes.io/instance-type=Si3.xlarge.2 node.kubernetes.io/subnetid=0b809883-9c99-4b67-a6ca-5c1a948a5def os.architecture=amd64 os.name=EulerOS_2.0_SP9x86_64 os.version=4.18.0-147.5.1.6.h541.eulerosv2r9.x86_64 topology.kubernetes.io/region=cn-southwest-1 topology.kubernetes.io/zone=cn-southwest-1a
其中:
节点所在区域:
Key:failure-domain.beta.kubernetes.io/region
Value:cn-southwest-1
节点所在可用区:
Key:failure-domain.beta.kubernetes.io/zone
Value:cn-southwest-1a,a表示az1,b表示az2,c表示az3
节点的主机名:
Key:kubernetes.io/hostname
Value:10.x.154.106
2.2、反亲和性设置 创建一个无状态组件,设置反亲和性,这里均采用软约束尽量满足。
创建成功后可见7个实例能分在两个AZ的4个节点上。
尝试将节点之间的调度用硬约束,AZ之间的约束用软约束,通过定义yaml来实现,定义一个pod-antiaffinity.yaml
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 apiVersion: apps/v1 kind: Deployment metadata: name: nginx-pod-affinity-test spec: selector: matchLabels: app: nginx-pod-affinity-test replicas: 5 template: metadata: labels: app: nginx-pod-affinity-test spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - nginx-pod-affinity-test topologyKey: failure-domain.beta.kubernetes.io/zone weight: 50 requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - nginx-pod-affinity-test topologyKey: kubernetes.io/hostname containers: - name: nginx-pod-affinity-test image: xxxx.xxx.xxx.com:80/public/nginx:x86-1.20.1 imagePullPolicy: IfNotPresent
这里的实例个数为5,实际上集群的计算节点只有4个,AZ只有两个,apply观察一下结果:
1 2 3 ➜ kubectl get deploy NAME READY UP-TO-DATE AVAILABLE AGE nginx-pod-affinity-test 4/5 5 4 33m
可见期望5个实例只起来了4个,查看一下POD实例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 ➜ kubectl get po -l app=nginx-pod-affinity-test NAME READY STATUS RESTARTS AGE nginx-pod-affinity-test-688d8f5fb5-44h4h 1/1 Running 0 35m nginx-pod-affinity-test-688d8f5fb5-9sddx 0/1 Pending 0 35m nginx-pod-affinity-test-688d8f5fb5-f2qxn 1/1 Running 0 35m nginx-pod-affinity-test-688d8f5fb5-nbzft 1/1 Running 0 35m nginx-pod-affinity-test-688d8f5fb5-zfdqb 1/1 Running 0 35m ➜ kubectl describe po nginx-pod-affinity-test-688d8f5fb5-9sddx Name: nginx-pod-affinity-test-688d8f5fb5-9sddx Namespace: default Priority: 0 Node: <none> Labels: app=nginx-pod-affinity-test pod-template-hash=688d8f5fb5 Annotations: kubernetes.io/psp: psp-global Status: Pending IP: IPs: <none> Controlled By: ReplicaSet/nginx-pod-affinity-test-688d8f5fb5 Containers: nginx-pod-affinity-test: Image: xxxx.xxx.xxx.com:80/public/nginx:x86-1.20.1 Port: <none> Host Port: <none> Environment: VMPORT: [] VMIP: (v1:status.hostIP) Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-47g7m (ro) Conditions: Type Status PodScheduled False Volumes: default-token-47g7m: Type: Secret (a volume populated by a Secret) SecretName: default-token-47g7m Optional: false QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 36m default-scheduler 0/4 nodes are available: 4 node(s) didn't match pod affinity/anti-affinity, 4 node(s) didn' t match pod anti-affinity rules. Warning FailedScheduling 36m default-scheduler 0/4 nodes are available: 4 node(s) didn't match pod affinity/anti-affinity, 4 node(s) didn' t match pod anti-affinity rules. Normal NotTriggerScaleUp 2s (x217 over 36m) cluster-autoscaler pod didn't trigger scale-up (it wouldn' t fit if a new node is added):
提示第5个POD创建不出来,因为上面的策略定义的是POD硬约束在每个节点上分开,期望5个POD就需要5个节点,然而我们只有4个节点,所以第五个是创建不出来的。修改下策略为软约束:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 apiVersion: apps/v1 kind: Deployment metadata: name: nginx-pod-affinity-test spec: selector: matchLabels: app: nginx-pod-affinity-test replicas: 5 template: metadata: labels: app: nginx-pod-affinity-test spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - nginx-pod-affinity-test topologyKey: failure-domain.beta.kubernetes.io/zone weight: 50 - podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - nginx-pod-affinity-test topologyKey: kubernetes.io/hostname weight: 50 containers: - name: nginx-pod-affinity-test image: xxxx.xxx.xxx.com:80/public/nginx:x86-1.20.1 imagePullPolicy: IfNotPresent
部署并查看POD实例:
1 2 3 4 5 6 7 8 9 10 11 12 13 ➜ kubectl apply -f pod-affinity/pod-affinity.yaml deployment.apps/nginx-pod-affinity-test created ➜ kubectl get deploy NAME READY UP-TO-DATE AVAILABLE AGE nginx-pod-affinity-test 5/5 5 5 5s ➜ kubectl get po -l app=nginx-pod-affinity-test -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-pod-affinity-test-8545c897c6-5ckjt 1/1 Running 0 93s 12.11.0.78 10.x.154.106 <none> <none> nginx-pod-affinity-test-8545c897c6-6svzj 1/1 Running 0 93s 12.11.0.122 10.x.154.161 <none> <none> nginx-pod-affinity-test-8545c897c6-hz2dp 1/1 Running 0 93s 12.11.0.57 10.x.154.39 <none> <none> nginx-pod-affinity-test-8545c897c6-ltc8w 1/1 Running 0 93s 12.11.0.104 10.x.154.147 <none> <none> nginx-pod-affinity-test-8545c897c6-q2m9h 1/1 Running 0 93s 12.11.0.58 10.x.154.39 <none> <none>
5个实例分在两个AZ的5个节点上了。修改实例个数为12个看看。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ➜ kubectl edit deploy nginx-pod-affinity-test deployment.apps/nginx-pod-affinity-test edited ➜ kubectl get po -l app=nginx-pod-affinity-test -o wide kubectl get po -l app=nginx-pod-affinity-test -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-pod-affinity-test-8545c897c6-5ckjt 1/1 Running 0 6m26s 12.11.0.78 10.x.154.106 <none> <none> nginx-pod-affinity-test-8545c897c6-6svzj 1/1 Running 0 6m26s 12.11.0.122 10.x.154.161 <none> <none> nginx-pod-affinity-test-8545c897c6-bgt4c 1/1 Running 0 50s 12.11.0.84 10.x.154.106 <none> <none> nginx-pod-affinity-test-8545c897c6-f7hbt 1/1 Running 0 50s 12.11.0.128 10.x.154.147 <none> <none> nginx-pod-affinity-test-8545c897c6-hz2dp 1/1 Running 0 6m26s 12.11.0.57 10.x.154.39 <none> <none> nginx-pod-affinity-test-8545c897c6-j4tlk 1/1 Running 0 50s 12.11.0.61 10.x.154.39 <none> <none> nginx-pod-affinity-test-8545c897c6-l7fvd 1/1 Running 0 50s 12.11.0.125 10.x.154.161 <none> <none> nginx-pod-affinity-test-8545c897c6-ltc8w 1/1 Running 0 6m26s 12.11.0.104 10.x.154.147 <none> <none> nginx-pod-affinity-test-8545c897c6-t62h7 1/1 Running 0 50s 12.11.0.127 10.x.154.161 <none> <none> nginx-pod-affinity-test-8545c897c6-w4rws 1/1 Running 0 50s 12.11.0.126 10.x.154.161 <none> <none> nginx-pod-affinity-test-8545c897c6-wbgvc 1/1 Running 0 50s 12.11.0.60 10.x.154.39 <none> <none> nginx-pod-affinity-test-8545c897c6-wtvvx 1/1 Running 0 50s 12.11.0.102 10.x.154.147 <none> <none>
四个节点上的POD分布如下:
AZ
节点IP
POD个数
AZ1
10.x.154.147
3
AZ1
10.x.154.106
2
AZ2
10.x.154.161
4
AZ2
10.x.154.39
3
欢迎关注公众号-梅旭红,记录技术之旅,不定期更新.