Create Affinity between Cassandra and Kubernetes

John Sanda
11 min readAug 12, 2019

In my previous post, we looked at how to deploy Cassandra clusters in Kubernetes using CassKop. CassKop makes it easy to configure different topologies for Cassandra. In this post we continue with the theme of topology configuration. We will explore how we can leverage CassKop to make sure Cassandra pods run on particular Kubernetes nodes as well as to make sure pods run in specific availability zones. CassKop makes this possible by using labels and affinity rules.

Labels in Kubernetes are key/value pairs that can be added to objects such as pods, services, and even the nodes on which pods run. Labels help organize, select, and filter objects in a Kubernetes cluster.

Affinity rules are scheduling rules that help determine on what nodes pods should or must run. Conversely anti-affinity rules are scheduling rules that help determine on what nodes pods should not or must not run. For a more detailed overview, see the official docs.

For cloud-based deployments of Cassandra it is common to organize racks by availability zone. This is done to improve availability and fault tolerance. For example, let’s say we have a nine node cluster spread across three racks, with each rack in a different availability zone. Let’s also suppose that the application keyspace(s) use relpication_factor = 3. This gives us a full set of replicas per rack. The cluster can withstand the loss of an entire availability zone and still maintain strong consistency.

With all that in mind, onto some examples!

Soft Anti-affinity

We will start with an easy example and work through progressively through more complex ones. This first example demonstrates a should or soft anti-affinity rule.

# soft-anti-affinity.yamlapiVersion: "db.orange.com/v1alpha1"
kind: "CassandraCluster"
metadata:
name: soft-anti-affinity
labels:
cluster: soft-anti-affinity
spec:
nodesPerRacks: 3
resources:
requests:
cpu: '1'
memory: 1Gi
limits:
cpu: '1'
memory: 1Gi

Note: For this and the next example, I am using a three-node Kubernetes cluster that is running in a single availability zone.

Create the CassandraCluster with:

$ kubectl apply -f soft-anti-affinity.yaml

As discussed in my previous article, CassKop creates a StatefulSet which in turn will create and manage the Cassandra pods:

$ kubectl get statefulset -l cassandracluster=soft-anti-affinity
NAME READY AGE
soft-anti-affinity-dc1-rack1 3/3 16m

Let’s take a closer look at the StatefulSet soft-ant-affinity-dc1-rack1:

$ kubectl get statefulset soft-anti-affinity-dc1-rack1 -o yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
creationTimestamp: "2019-08-10T03:04:32Z"
generation: 1
labels:
app: cassandracluster
cassandracluster: soft-anti-affinity
cassandraclusters.db.orange.com.dc: dc1
cassandraclusters.db.orange.com.rack: rack1
cluster: soft-anti-affinity
dc-rack: dc1-rack1
name: soft-anti-affinity-dc1-rack1
namespace: default
ownerReferences:
- apiVersion: db.orange.com/v1alpha1
controller: true
kind: CassandraCluster
name: soft-anti-affinity
uid: 9373658d-bb1b-11e9-8455-42010a8e0041
resourceVersion: "14661"
selfLink: /apis/apps/v1/namespaces/default/statefulsets/soft-anti-affinity-dc1-rack1
uid: 939d5e1f-bb1b-11e9-8455-42010a8e0041
spec:
podManagementPolicy: OrderedReady
replicas: 3
revisionHistoryLimit: 10
selector:
matchLabels:
app: cassandracluster
cassandracluster: soft-anti-affinity
cassandraclusters.db.orange.com.dc: dc1
cassandraclusters.db.orange.com.rack: rack1
cluster: soft-anti-affinity
dc-rack: dc1-rack1
serviceName: soft-anti-affinity
template:
metadata:
creationTimestamp: null
labels:
app: cassandracluster
cassandracluster: soft-anti-affinity
cassandraclusters.db.orange.com.dc: dc1
cassandraclusters.db.orange.com.rack: rack1
cluster: soft-anti-affinity
dc-rack: dc1-rack1

spec:
affinity:
nodeAffinity: {}
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
app: cassandracluster
cassandracluster: soft-anti-affinity
cluster: soft-anti-affinity
topologyKey: kubernetes.io/hostname
weight: 100

containers:
- env:
- name: CASSANDRA_MAX_HEAP
value: 125M
- name: CASSANDRA_SEEDS
value: soft-anti-affinity-dc1-rack1-0.soft-anti-affinity.default,soft-anti-affinity-dc1-rack1-1.soft-anti-affinity.default,soft-anti-affinityr-dc1-rack1-2.soft-anti-affinity.default
- name: CASSANDRA_CLUSTER_NAME
value: soft-anti-affinity
- name: CASSANDRA_AUTO_BOOTSTRAP
value: "true"
- name: CASSANDRA_AUTHENTICATOR
value: PasswordAuthenticator
- name: CASSANDRA_AUTHORIZER
value: CassandraAuthorizer
- name: POD_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: SERVICE_NAME
value: soft-anti-affinity-dc1-rack1
- name: CASSANDRA_GC_STDOUT
value: "false"
- name: CASSANDRA_NUM_TOKENS
value: "256"
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: CASSANDRA_DC
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.labels['cassandraclusters.db.orange.com.dc']
- name: CASSANDRA_RACK
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.labels['cassandraclusters.db.orange.com.rack']
image: orangeopensource/cassandra-image:latest
imagePullPolicy: Always
lifecycle:
preStop:
exec:
command:
- /bin/bash
- -c
- /etc/cassandra/pre_stop.sh
livenessProbe:
exec:
command:
- /bin/bash
- -c
- nodetool status
failureThreshold: 3
initialDelaySeconds: 120
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 20
name: cassandra
ports:
- containerPort: 7000
name: intra-node
protocol: TCP
- containerPort: 7001
name: intra-node-tls
protocol: TCP
- containerPort: 7199
name: jmx-port
protocol: TCP
- containerPort: 9042
name: cql
protocol: TCP
- containerPort: 1234
name: promjmx
protocol: TCP
- containerPort: 8778
name: jolokia
protocol: TCP
readinessProbe:
exec:
command:
- /bin/bash
- -c
- /ready-probe.sh
failureThreshold: 3
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
resources:
limits:
cpu: '1'
memory: 1Gi
requests:
cpu: '1'
memory: 1Gi
securityContext:
capabilities:
add:
- IPC_LOCK
procMount: Default
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
fsGroup: 1
runAsNonRoot: true
runAsUser: 1000
terminationGracePeriodSeconds: 1800
updateStrategy:
rollingUpdate:
partition: 0
type: RollingUpdate
status:
collisionCount: 0
currentReplicas: 3
currentRevision: soft-anti-affinity-dc1-rack1-78d85d9586
observedGeneration: 1
readyReplicas: 3
replicas: 3
updateRevision: soft-anti-affinity-dc1-rack1-78d85d9586
updatedReplicas: 3

There is a lot here, but we just want to focus on the parts in bold, namely thespec.template.spec.affinity property:

template:
metadata:
creationTimestamp: null
labels:
app: cassandracluster
cassandracluster: soft-anti-affinity
cassandraclusters.db.orange.com.dc: dc1
cassandraclusters.db.orange.com.rack: rack1
cluster: soft-anti-affinity
dc-rack: dc1-rack1

spec:
affinity:
nodeAffinity: {}
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
app: cassandracluster
cassandracluster: soft-anti-affinity
cluster: soft-anti-affinity
topologyKey: kubernetes.io/hostname

weight: 100

For the soft-anti-affinity cluster, CassKop did not specify any affinity rules. This is evident from the fact that the nodeAffinity property is empty.

CassKop does however declare anti-affinity rules by setting the podAntiAffinity property. The preferredDuringSchedulingIgnoredDuringExecution rule says that a pod should not be scheduled on a node if is already running a pod that has labels that match the label selector. In other words, the Kubernetes scheduler should not schedule two Cassandra pods from soft-anti-affinity on the same node, but it will put them on the same node if there are insufficient resources to run them on separate nodes.

topologyKey essentially defines a topology domain. The matching, as specified by the label selectors, will apply to any node.

This type of scheduling rule can be convenient for development/testing scenarios in which it may be acceptable to run multiple Cassandra instances on the same Kubernetes nodes.

Note that while this anti-affinity rule results in the scheduler making a best effort to places pods from soft-anti-affinity onto separate nodes, it does not say anything about pods from other Cassandra clusters. For example, let’s say we create another CassandraCluster, another-cluster. The scheduler will happily place pods from soft-anti-affinity and another-cluster on the same node.

Let’s check the cluster status with nodetool:

$ kubectl exec soft-anti-affinity-dc1-rack1-0 -- nodetool status -r | awk {'print $1" "substr($2, 0, 30)" "$8'} | column -t
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/M
-- Address ID
UN soft-anti-affinity-dc1-rack1-2 rack1
UN soft-anti-affinity-dc1-rack1-0 rack1
UN soft-anti-affinity-dc1-rack1-1 rack1

There are a few things to note. First, the Cassandra cluster is healthy. All three Cassandra nodes are in the Up/Normal state.

Secondly, we are using a lot formatting used here. The motivation is to make the output more readable by focusing only on the most relevant parts.

Lastly, we use the -r flag withnodetool status in order to display the domain name which is also the pod name. This will help with determining on which Kubernetes node each Cassandra pod is running.

Now let’s query for pods:

$ kubectl get pods -l cassandracluster=soft-anti-affinity -o wide | awk {'print $1" "$7'} | column -t
NAME NODE
soft-anti-affinity-dc1-rack1-0 gke-cass-dev-default-pool-daf646ab-j2m3
soft-anti-affinity-dc1-rack1-1 gke-cass-dev-default-pool-daf646ab-5c6n
soft-anti-affinity-dc1-rack1-2 gke-cass-dev-default-pool-daf646ab-sjbh

Each pod is running on a different node. If we add another instance to the soft-anti-affinity cluster, Kubernetes will schedule the pod on one of the nodes, as long as there are sufficient resources.

Hard Anti-Affinity

In this next example, we are going to make use of the hardAntiAffinity property of CassandraCluster for the first time:

# hard-anti-affinity.yamlapiVersion: "db.orange.com/v1alpha1"
kind: "CassandraCluster"
metadata:
name: hard-anti-affinity
labels:
cluster: hard-anti-affinity
spec:
nodesPerRacks: 3
hardAntiAffinity: true
resources:
requests:
cpu: '1'
memory: 1Gi
limits:
cpu: '1'
memory: 1Gi

hardAntiAffinity defaults to false. When set to true CassKop will generate a must or hard anti-affinity rule.

CassKop creates a StatefulSet named hard-anti-affinity-dc1-rack1. Let’s take a look at its spec.template.spec.affinity property:

affinity:
nodeAffinity: {}
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: cassandracluster
cassandracluster: hard-anti-affinity
cluster: hard-anti-affinity
topologyKey: kubernetes.io/hostname

This time we have a requireDuringSchedulingIgnoredDuringExecution rule which is a hard requirement. The scheduler will not run two pods from hard-anti-affinity on the same node.

Let’s check the cluster status with nodetool:

$ kubectl exec hard-anti-affinity-dc1-rack1-0 -- nodetool status -r | awk {'print $1" "substr($2, 0, 30)" "$8'} | column -t
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/M
-- Address ID
UN hard-anti-affinity-dc1-rack1-2 rack1
UN hard-anti-affinity-dc1-rack1-0 rack1
UN hard-anti-affinity-dc1-rack1-1 rack1

The Cassandra cluster looks healthy. All nodes are in the Up/Normal state. Now let’s query for pods:

$ kubectl get pods -l cassandracluster=hard-anti-affinity -o wide | awk {'print $1" "$7'} | column -t
NAME NODE
hard-anti-affinity-dc1-rack1-0 gke-cass-dev-default-pool-daf646ab-j2m3
hard-anti-affinity-dc1-rack1-1 gke-cass-dev-default-pool-daf646ab-5c6n
hard-anti-affinity-dc1-rack1-2 gke-cass-dev-default-pool-daf646ab-sjbh

Each pod is running on a different node.

Let’s see what happens if we try to add a fourth Cassandra node. We can do this by changing the spec.nodesPerRack property in hard-anti-affinity.yaml to 4 and then run:

$ kubectl apply -f hard-anti-affinity.yaml

Check the status of the CassandraCluster:

$ kubectl get cassandracluster hard-anti-affinity -o yaml
apiVersion: db.orange.com/v1alpha1
kind: CassandraCluster
...
status:
cassandraRackStatus:
dc1-rack1:
cassandraLastAction:
Name: ScaleUp
startTime: "2019-08-12T01:57:23Z"
status: Ongoing
phase: Pending
podLastOperation: {}
lastClusterAction: ScaleUp
lastClusterActionStatus: Ongoing
phase: Pending

seedlist:
- hard-anti-affinity-dc1-rack1-0.hard-anti-affinity.default
- hard-anti-affinity-dc1-rack1-1.hard-anti-affinity.default
- hard-anti-affinity-dc1-rack1-2.hard-anti-affinity.default

The status reports phase: Pending. The new pod has been created but not scheduled. Let’s take a look at the new pod:

$ kubectl get pod hard-anti-affinity-dc1-rack1-3 -o yaml
apiVersion: v1
kind: Pod
...
status:
conditions:
- lastProbeTime: "2019-08-12T02:15:45Z"
lastTransitionTime: "2019-08-12T02:14:41Z"
message: '0/3 nodes are available: 3 node(s) didn''t match pod affinity/anti-affinity,
3 node(s) didn''t satisfy existing pods anti-affinity rules.'
reason: Unschedulable

status: "False"
type: PodScheduled
phase: Pending
qosClass: Guaranteed

The message tells us that no node satisfies the anti-affinity rule; therefore, the hard-anti-affinity-dc1-rack1-3 pod cannot run.

Hard Affinity

We are going to create a CassandraCluster that consists of a single data center with two racks. Each rack will be placed in a separate availability zone.

# hard-affinity.yamlapiVersion: "db.orange.com/v1alpha1"
kind: "CassandraCluster"
metadata:
name: hard-affinity
labels:
cluster: hard-affinity
spec:
nodesPerRacks: 3
resources:
requests:
cpu: '1'
memory: 1Gi
limits:
cpu: '1'
memory: 1Gi
topology:
dc:
- name: dc1
rack:
- name: rack1
labels:
failure-domain.beta.kubernetes.io/zone: us-east1-b

- name: rack2
labels:
failure-domain.beta.kubernetes.io/zone: us-east1-c

This is the first time I have used the labels property of a rack. Any label can be specified; however, the label used here is not arbitrary.

Nodes in Kubernetes come pre-populated with a handful of labels, one of which is failure-domain.beta.kubernetes.io/zone. We are using the rack labels to tell CassKop to place rack1 in zone us-east1-b and rack2 in zone us-east1-c.

CassKop will create two StatefulSets for hard-affinity, one for each rack. Let’s take a look at the spec.template.spec.affinity property for one of the StatefulSets, hard-affinity-dc1-rack1:

template:
metadata:
creationTimestamp: null
labels:
app: cassandracluster
cassandracluster: hard-affinity
cassandraclusters.db.orange.com.dc: dc1
cassandraclusters.db.orange.com.rack: rack1
cluster: hard-affinity
dc-rack: dc1-rack1
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: failure-domain.beta.kubernetes.io/zone
operator: In
values:
- us-east1-b

podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
app: cassandracluster
cassandracluster: hard-affinity
cluster: hard-affinity
topologyKey: kubernetes.io/hostname
weight: 100

CassKop has declared a requiredDuringSchedulingIgnoredDuringExecution rule which says that nodes must be scheduled in zone us-east1-b. We also have the anti-affinity rule that tells the scheduler to make a best effort to schedule pods on separate nodes.

CassKop creates a similar affinity rule for hard-affinity-dc1-rack2 but the value in the match expression is set to us-east1-c.

Let’s check the cluster status with nodetool:

$ kubectl exec hard-affinity-dc1-rack2-0 -- nodetool status -r | awk {'print $1" "substr($2, 0, 25)" "$8'} | column -t
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Join
-- Address ID
UN hard-affinity-dc1-rack2-1 rack2
UN hard-affinity-dc1-rack1-2 rack1
UN hard-affinity-dc1-rack2-0 rack2
UN hard-affinity-dc1-rack1-1 rack1
UN hard-affinity-dc1-rack2-2 rack2
UN hard-affinity-dc1-rack1-0 rack1

The Cassandra cluster looks healthy. All nodes are in the Up/Normal state. We have six Cassandra nodes spread across two racks.

The nodetool output does not provide enough information though to confirm that pods are running in the desired zones. Let’s inspect a bit more to verify that the pods are running where we expect them to be.

The Kubernetes cluster used for this example is made up of the following nodes:

Node                                              | Zone
-------------------------------------------------------------
gke-cass-dev-multizone-default-pool-018f2d3d-0lbl | us-east1-d
gke-cass-dev-multizone-default-pool-018f2d3d-6lnk | us-east1-d
gke-cass-dev-multizone-default-pool-018f2d3d-7760 | us-east1-d
--------------------------------------------------|----------
gke-cass-dev-multizone-default-pool-112f0d40-564v | us-east1-c
gke-cass-dev-multizone-default-pool-112f0d40-db52 | us-east1-c
gke-cass-dev-multizone-default-pool-112f0d40-fdmg | us-east1-c
-------------------------------------------------------------
gke-cass-dev-multizone-default-pool-8060f7a0-j6sl | us-east1-b
gke-cass-dev-multizone-default-pool-8060f7a0-qgl2 | us-east1-b
gke-cass-dev-multizone-default-pool-8060f7a0-xm4t | us-east1-b

Although this cluster has 3 nodes in us-east1-d, no Cassandra pods should be running in that zone.

Now let’s query for pods:

$ kubectl get pods -l cassandracluster=hard-affinity -o wide | awk {'print $1" "$7'} | column -t
NAME NODE
hard-affinity-dc1-rack1-0 gke-cass-dev-multizone-default-pool-8060f7a0-j6sl
hard-affinity-dc1-rack1-1 gke-cass-dev-multizone-default-pool-8060f7a0-xm4t
hard-affinity-dc1-rack1-2 gke-cass-dev-multizone-default-pool-8060f7a0-qgl2
hard-affinity-dc1-rack2-0 gke-cass-dev-multizone-default-pool-112f0d40-db52
hard-affinity-dc1-rack2-1 gke-cass-dev-multizone-default-pool-112f0d40-fdmg
hard-affinity-dc1-rack2-2 gke-cass-dev-multizone-default-pool-112f0d40-564v

The pods in rack1 are all running on nodes in zone us-east1-b, and the pods in rack2 are running on nodes in zone us-east1-c. All pods are running where we expect them to be running!

Conclusion

Different types of applications will have different performance profiles and different resource requirements. Cassandra boasts high performance, but that is largely dependent on available resources in terms of CPU, memory, and I/O. CassKop provides a flexible and intuitive way to make sure Cassandra runs on specific Kubernetes nodes. These scheduling capabilities are necessary components to be able to effectively run Cassandra in Kubernetes for production workloads.

--

--