Create Affinity between Cassandra and Kubernetes
In my previous post, we looked at how to deploy Cassandra clusters in Kubernetes using CassKop. CassKop makes it easy to configure different topologies for Cassandra. In this post we continue with the theme of topology configuration. We will explore how we can leverage CassKop to make sure Cassandra pods run on particular Kubernetes nodes as well as to make sure pods run in specific availability zones. CassKop makes this possible by using labels and affinity rules.
Labels in Kubernetes are key/value pairs that can be added to objects such as pods, services, and even the nodes on which pods run. Labels help organize, select, and filter objects in a Kubernetes cluster.
Affinity rules are scheduling rules that help determine on what nodes pods should or must run. Conversely anti-affinity rules are scheduling rules that help determine on what nodes pods should not or must not run. For a more detailed overview, see the official docs.
For cloud-based deployments of Cassandra it is common to organize racks by availability zone. This is done to improve availability and fault tolerance. For example, let’s say we have a nine node cluster spread across three racks, with each rack in a different availability zone. Let’s also suppose that the application keyspace(s) use relpication_factor = 3
. This gives us a full set of replicas per rack. The cluster can withstand the loss of an entire availability zone and still maintain strong consistency.
With all that in mind, onto some examples!
Soft Anti-affinity
We will start with an easy example and work through progressively through more complex ones. This first example demonstrates a should or soft anti-affinity rule.
# soft-anti-affinity.yamlapiVersion: "db.orange.com/v1alpha1"
kind: "CassandraCluster"
metadata:
name: soft-anti-affinity
labels:
cluster: soft-anti-affinity
spec:
nodesPerRacks: 3
resources:
requests:
cpu: '1'
memory: 1Gi
limits:
cpu: '1'
memory: 1Gi
Note: For this and the next example, I am using a three-node Kubernetes cluster that is running in a single availability zone.
Create the CassandraCluster
with:
$ kubectl apply -f soft-anti-affinity.yaml
As discussed in my previous article, CassKop creates a StatefulSet which in turn will create and manage the Cassandra pods:
$ kubectl get statefulset -l cassandracluster=soft-anti-affinity
NAME READY AGE
soft-anti-affinity-dc1-rack1 3/3 16m
Let’s take a closer look at the StatefulSet soft-ant-affinity-dc1-rack1
:
$ kubectl get statefulset soft-anti-affinity-dc1-rack1 -o yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
creationTimestamp: "2019-08-10T03:04:32Z"
generation: 1
labels:
app: cassandracluster
cassandracluster: soft-anti-affinity
cassandraclusters.db.orange.com.dc: dc1
cassandraclusters.db.orange.com.rack: rack1
cluster: soft-anti-affinity
dc-rack: dc1-rack1
name: soft-anti-affinity-dc1-rack1
namespace: default
ownerReferences:
- apiVersion: db.orange.com/v1alpha1
controller: true
kind: CassandraCluster
name: soft-anti-affinity
uid: 9373658d-bb1b-11e9-8455-42010a8e0041
resourceVersion: "14661"
selfLink: /apis/apps/v1/namespaces/default/statefulsets/soft-anti-affinity-dc1-rack1
uid: 939d5e1f-bb1b-11e9-8455-42010a8e0041
spec:
podManagementPolicy: OrderedReady
replicas: 3
revisionHistoryLimit: 10
selector:
matchLabels:
app: cassandracluster
cassandracluster: soft-anti-affinity
cassandraclusters.db.orange.com.dc: dc1
cassandraclusters.db.orange.com.rack: rack1
cluster: soft-anti-affinity
dc-rack: dc1-rack1
serviceName: soft-anti-affinity
template:
metadata:
creationTimestamp: null
labels:
app: cassandracluster
cassandracluster: soft-anti-affinity
cassandraclusters.db.orange.com.dc: dc1
cassandraclusters.db.orange.com.rack: rack1
cluster: soft-anti-affinity
dc-rack: dc1-rack1
spec:
affinity:
nodeAffinity: {}
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
app: cassandracluster
cassandracluster: soft-anti-affinity
cluster: soft-anti-affinity
topologyKey: kubernetes.io/hostname
weight: 100
containers:
- env:
- name: CASSANDRA_MAX_HEAP
value: 125M
- name: CASSANDRA_SEEDS
value: soft-anti-affinity-dc1-rack1-0.soft-anti-affinity.default,soft-anti-affinity-dc1-rack1-1.soft-anti-affinity.default,soft-anti-affinityr-dc1-rack1-2.soft-anti-affinity.default
- name: CASSANDRA_CLUSTER_NAME
value: soft-anti-affinity
- name: CASSANDRA_AUTO_BOOTSTRAP
value: "true"
- name: CASSANDRA_AUTHENTICATOR
value: PasswordAuthenticator
- name: CASSANDRA_AUTHORIZER
value: CassandraAuthorizer
- name: POD_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: SERVICE_NAME
value: soft-anti-affinity-dc1-rack1
- name: CASSANDRA_GC_STDOUT
value: "false"
- name: CASSANDRA_NUM_TOKENS
value: "256"
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: CASSANDRA_DC
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.labels['cassandraclusters.db.orange.com.dc']
- name: CASSANDRA_RACK
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.labels['cassandraclusters.db.orange.com.rack']
image: orangeopensource/cassandra-image:latest
imagePullPolicy: Always
lifecycle:
preStop:
exec:
command:
- /bin/bash
- -c
- /etc/cassandra/pre_stop.sh
livenessProbe:
exec:
command:
- /bin/bash
- -c
- nodetool status
failureThreshold: 3
initialDelaySeconds: 120
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 20
name: cassandra
ports:
- containerPort: 7000
name: intra-node
protocol: TCP
- containerPort: 7001
name: intra-node-tls
protocol: TCP
- containerPort: 7199
name: jmx-port
protocol: TCP
- containerPort: 9042
name: cql
protocol: TCP
- containerPort: 1234
name: promjmx
protocol: TCP
- containerPort: 8778
name: jolokia
protocol: TCP
readinessProbe:
exec:
command:
- /bin/bash
- -c
- /ready-probe.sh
failureThreshold: 3
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
resources:
limits:
cpu: '1'
memory: 1Gi
requests:
cpu: '1'
memory: 1Gi
securityContext:
capabilities:
add:
- IPC_LOCK
procMount: Default
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
fsGroup: 1
runAsNonRoot: true
runAsUser: 1000
terminationGracePeriodSeconds: 1800
updateStrategy:
rollingUpdate:
partition: 0
type: RollingUpdate
status:
collisionCount: 0
currentReplicas: 3
currentRevision: soft-anti-affinity-dc1-rack1-78d85d9586
observedGeneration: 1
readyReplicas: 3
replicas: 3
updateRevision: soft-anti-affinity-dc1-rack1-78d85d9586
updatedReplicas: 3
There is a lot here, but we just want to focus on the parts in bold, namely thespec.template.spec.affinity
property:
template:
metadata:
creationTimestamp: null
labels:
app: cassandracluster
cassandracluster: soft-anti-affinity
cassandraclusters.db.orange.com.dc: dc1
cassandraclusters.db.orange.com.rack: rack1
cluster: soft-anti-affinity
dc-rack: dc1-rack1
spec:
affinity:
nodeAffinity: {}
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
app: cassandracluster
cassandracluster: soft-anti-affinity
cluster: soft-anti-affinity
topologyKey: kubernetes.io/hostname
weight: 100
For the soft-anti-affinity
cluster, CassKop did not specify any affinity rules. This is evident from the fact that the nodeAffinity
property is empty.
CassKop does however declare anti-affinity rules by setting the podAntiAffinity
property. The preferredDuringSchedulingIgnoredDuringExecution
rule says that a pod should not be scheduled on a node if is already running a pod that has labels that match the label selector. In other words, the Kubernetes scheduler should not schedule two Cassandra pods from soft-anti-affinity
on the same node, but it will put them on the same node if there are insufficient resources to run them on separate nodes.
topologyKey
essentially defines a topology domain. The matching, as specified by the label selectors, will apply to any node.
This type of scheduling rule can be convenient for development/testing scenarios in which it may be acceptable to run multiple Cassandra instances on the same Kubernetes nodes.
Note that while this anti-affinity rule results in the scheduler making a best effort to places pods from soft-anti-affinity
onto separate nodes, it does not say anything about pods from other Cassandra clusters. For example, let’s say we create another CassandraCluster, another-cluster
. The scheduler will happily place pods from soft-anti-affinity
and another-cluster
on the same node.
Let’s check the cluster status with nodetool
:
$ kubectl exec soft-anti-affinity-dc1-rack1-0 -- nodetool status -r | awk {'print $1" "substr($2, 0, 30)" "$8'} | column -t
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/M
-- Address ID
UN soft-anti-affinity-dc1-rack1-2 rack1
UN soft-anti-affinity-dc1-rack1-0 rack1
UN soft-anti-affinity-dc1-rack1-1 rack1
There are a few things to note. First, the Cassandra cluster is healthy. All three Cassandra nodes are in the Up/Normal state.
Secondly, we are using a lot formatting used here. The motivation is to make the output more readable by focusing only on the most relevant parts.
Lastly, we use the -r
flag withnodetool status
in order to display the domain name which is also the pod name. This will help with determining on which Kubernetes node each Cassandra pod is running.
Now let’s query for pods:
$ kubectl get pods -l cassandracluster=soft-anti-affinity -o wide | awk {'print $1" "$7'} | column -t
NAME NODE
soft-anti-affinity-dc1-rack1-0 gke-cass-dev-default-pool-daf646ab-j2m3
soft-anti-affinity-dc1-rack1-1 gke-cass-dev-default-pool-daf646ab-5c6n
soft-anti-affinity-dc1-rack1-2 gke-cass-dev-default-pool-daf646ab-sjbh
Each pod is running on a different node. If we add another instance to the soft-anti-affinity
cluster, Kubernetes will schedule the pod on one of the nodes, as long as there are sufficient resources.
Hard Anti-Affinity
In this next example, we are going to make use of the hardAntiAffinity
property of CassandraCluster
for the first time:
# hard-anti-affinity.yamlapiVersion: "db.orange.com/v1alpha1"
kind: "CassandraCluster"
metadata:
name: hard-anti-affinity
labels:
cluster: hard-anti-affinity
spec:
nodesPerRacks: 3
hardAntiAffinity: true
resources:
requests:
cpu: '1'
memory: 1Gi
limits:
cpu: '1'
memory: 1Gi
hardAntiAffinity
defaults to false
. When set to true
CassKop will generate a must or hard anti-affinity rule.
CassKop creates a StatefulSet named hard-anti-affinity-dc1-rack1
. Let’s take a look at its spec.template.spec.affinity
property:
affinity:
nodeAffinity: {}
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: cassandracluster
cassandracluster: hard-anti-affinity
cluster: hard-anti-affinity
topologyKey: kubernetes.io/hostname
This time we have a requireDuringSchedulingIgnoredDuringExecution
rule which is a hard requirement. The scheduler will not run two pods from hard-anti-affinity
on the same node.
Let’s check the cluster status with nodetool
:
$ kubectl exec hard-anti-affinity-dc1-rack1-0 -- nodetool status -r | awk {'print $1" "substr($2, 0, 30)" "$8'} | column -t
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/M
-- Address ID
UN hard-anti-affinity-dc1-rack1-2 rack1
UN hard-anti-affinity-dc1-rack1-0 rack1
UN hard-anti-affinity-dc1-rack1-1 rack1
The Cassandra cluster looks healthy. All nodes are in the Up/Normal state. Now let’s query for pods:
$ kubectl get pods -l cassandracluster=hard-anti-affinity -o wide | awk {'print $1" "$7'} | column -t
NAME NODE
hard-anti-affinity-dc1-rack1-0 gke-cass-dev-default-pool-daf646ab-j2m3
hard-anti-affinity-dc1-rack1-1 gke-cass-dev-default-pool-daf646ab-5c6n
hard-anti-affinity-dc1-rack1-2 gke-cass-dev-default-pool-daf646ab-sjbh
Each pod is running on a different node.
Let’s see what happens if we try to add a fourth Cassandra node. We can do this by changing the spec.nodesPerRack
property in hard-anti-affinity.yaml
to 4 and then run:
$ kubectl apply -f hard-anti-affinity.yaml
Check the status of the CassandraCluster
:
$ kubectl get cassandracluster hard-anti-affinity -o yaml
apiVersion: db.orange.com/v1alpha1
kind: CassandraCluster
...
status:
cassandraRackStatus:
dc1-rack1:
cassandraLastAction:
Name: ScaleUp
startTime: "2019-08-12T01:57:23Z"
status: Ongoing
phase: Pending
podLastOperation: {}
lastClusterAction: ScaleUp
lastClusterActionStatus: Ongoing
phase: Pending
seedlist:
- hard-anti-affinity-dc1-rack1-0.hard-anti-affinity.default
- hard-anti-affinity-dc1-rack1-1.hard-anti-affinity.default
- hard-anti-affinity-dc1-rack1-2.hard-anti-affinity.default
The status reports phase: Pending
. The new pod has been created but not scheduled. Let’s take a look at the new pod:
$ kubectl get pod hard-anti-affinity-dc1-rack1-3 -o yaml
apiVersion: v1
kind: Pod
...
status:
conditions:
- lastProbeTime: "2019-08-12T02:15:45Z"
lastTransitionTime: "2019-08-12T02:14:41Z"
message: '0/3 nodes are available: 3 node(s) didn''t match pod affinity/anti-affinity,
3 node(s) didn''t satisfy existing pods anti-affinity rules.'
reason: Unschedulable
status: "False"
type: PodScheduled
phase: Pending
qosClass: Guaranteed
The message tells us that no node satisfies the anti-affinity rule; therefore, the hard-anti-affinity-dc1-rack1-3
pod cannot run.
Hard Affinity
We are going to create a CassandraCluster
that consists of a single data center with two racks. Each rack will be placed in a separate availability zone.
# hard-affinity.yamlapiVersion: "db.orange.com/v1alpha1"
kind: "CassandraCluster"
metadata:
name: hard-affinity
labels:
cluster: hard-affinity
spec:
nodesPerRacks: 3
resources:
requests:
cpu: '1'
memory: 1Gi
limits:
cpu: '1'
memory: 1Gi
topology:
dc:
- name: dc1
rack:
- name: rack1
labels:
failure-domain.beta.kubernetes.io/zone: us-east1-b
- name: rack2
labels:
failure-domain.beta.kubernetes.io/zone: us-east1-c
This is the first time I have used the labels
property of a rack
. Any label can be specified; however, the label used here is not arbitrary.
Nodes in Kubernetes come pre-populated with a handful of labels, one of which is failure-domain.beta.kubernetes.io/zone
. We are using the rack labels to tell CassKop to place rack1 in zone us-east1-b and rack2 in zone us-east1-c.
CassKop will create two StatefulSets for hard-affinity
, one for each rack. Let’s take a look at the spec.template.spec.affinity
property for one of the StatefulSets, hard-affinity-dc1-rack1
:
template:
metadata:
creationTimestamp: null
labels:
app: cassandracluster
cassandracluster: hard-affinity
cassandraclusters.db.orange.com.dc: dc1
cassandraclusters.db.orange.com.rack: rack1
cluster: hard-affinity
dc-rack: dc1-rack1
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: failure-domain.beta.kubernetes.io/zone
operator: In
values:
- us-east1-b
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
app: cassandracluster
cassandracluster: hard-affinity
cluster: hard-affinity
topologyKey: kubernetes.io/hostname
weight: 100
CassKop has declared a requiredDuringSchedulingIgnoredDuringExecution
rule which says that nodes must be scheduled in zone us-east1-b. We also have the anti-affinity rule that tells the scheduler to make a best effort to schedule pods on separate nodes.
CassKop creates a similar affinity rule for hard-affinity-dc1-rack2
but the value in the match expression is set to us-east1-c.
Let’s check the cluster status with nodetool
:
$ kubectl exec hard-affinity-dc1-rack2-0 -- nodetool status -r | awk {'print $1" "substr($2, 0, 25)" "$8'} | column -t
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Join
-- Address ID
UN hard-affinity-dc1-rack2-1 rack2
UN hard-affinity-dc1-rack1-2 rack1
UN hard-affinity-dc1-rack2-0 rack2
UN hard-affinity-dc1-rack1-1 rack1
UN hard-affinity-dc1-rack2-2 rack2
UN hard-affinity-dc1-rack1-0 rack1
The Cassandra cluster looks healthy. All nodes are in the Up/Normal state. We have six Cassandra nodes spread across two racks.
The nodetool
output does not provide enough information though to confirm that pods are running in the desired zones. Let’s inspect a bit more to verify that the pods are running where we expect them to be.
The Kubernetes cluster used for this example is made up of the following nodes:
Node | Zone
-------------------------------------------------------------
gke-cass-dev-multizone-default-pool-018f2d3d-0lbl | us-east1-d
gke-cass-dev-multizone-default-pool-018f2d3d-6lnk | us-east1-d
gke-cass-dev-multizone-default-pool-018f2d3d-7760 | us-east1-d
--------------------------------------------------|----------
gke-cass-dev-multizone-default-pool-112f0d40-564v | us-east1-c
gke-cass-dev-multizone-default-pool-112f0d40-db52 | us-east1-c
gke-cass-dev-multizone-default-pool-112f0d40-fdmg | us-east1-c
-------------------------------------------------------------
gke-cass-dev-multizone-default-pool-8060f7a0-j6sl | us-east1-b
gke-cass-dev-multizone-default-pool-8060f7a0-qgl2 | us-east1-b
gke-cass-dev-multizone-default-pool-8060f7a0-xm4t | us-east1-b
Although this cluster has 3 nodes in us-east1-d, no Cassandra pods should be running in that zone.
Now let’s query for pods:
$ kubectl get pods -l cassandracluster=hard-affinity -o wide | awk {'print $1" "$7'} | column -t
NAME NODE
hard-affinity-dc1-rack1-0 gke-cass-dev-multizone-default-pool-8060f7a0-j6sl
hard-affinity-dc1-rack1-1 gke-cass-dev-multizone-default-pool-8060f7a0-xm4t
hard-affinity-dc1-rack1-2 gke-cass-dev-multizone-default-pool-8060f7a0-qgl2
hard-affinity-dc1-rack2-0 gke-cass-dev-multizone-default-pool-112f0d40-db52
hard-affinity-dc1-rack2-1 gke-cass-dev-multizone-default-pool-112f0d40-fdmg
hard-affinity-dc1-rack2-2 gke-cass-dev-multizone-default-pool-112f0d40-564v
The pods in rack1 are all running on nodes in zone us-east1-b, and the pods in rack2 are running on nodes in zone us-east1-c. All pods are running where we expect them to be running!
Conclusion
Different types of applications will have different performance profiles and different resource requirements. Cassandra boasts high performance, but that is largely dependent on available resources in terms of CPU, memory, and I/O. CassKop provides a flexible and intuitive way to make sure Cassandra runs on specific Kubernetes nodes. These scheduling capabilities are necessary components to be able to effectively run Cassandra in Kubernetes for production workloads.