Create Affinity between Cassandra and Kubernetes

11 min readAug 12, 2019

In my previous post, we looked at how to deploy Cassandra clusters in Kubernetes using CassKop. CassKop makes it easy to configure different topologies for Cassandra. In this post we continue with the theme of topology configuration. We will explore how we can leverage CassKop to make sure Cassandra pods run on particular Kubernetes nodes as well as to make sure pods run in specific availability zones. CassKop makes this possible by using labels and affinity rules.

Labels in Kubernetes are key/value pairs that can be added to objects such as pods, services, and even the nodes on which pods run. Labels help organize, select, and filter objects in a Kubernetes cluster.

Affinity rules are scheduling rules that help determine on what nodes pods should or must run. Conversely anti-affinity rules are scheduling rules that help determine on what nodes pods should not or must not run. For a more detailed overview, see the official docs.

For cloud-based deployments of Cassandra it is common to organize racks by availability zone. This is done to improve availability and fault tolerance. For example, let’s say we have a nine node cluster spread across three racks, with each rack in a different availability zone. Let’s also suppose that the application keyspace(s) use relpication_factor = 3. This gives us a full set of replicas per rack. The cluster can withstand the loss of an entire availability zone and still maintain strong consistency.

With all that in mind, onto some examples!

Soft Anti-affinity

We will start with an easy example and work through progressively through more complex ones. This first example demonstrates a should or soft anti-affinity rule.

# soft-anti-affinity.yamlapiVersion: "db.orange.com/v1alpha1"
kind: "CassandraCluster"
metadata:
  name: soft-anti-affinity
  labels:
    cluster: soft-anti-affinity
spec:
  nodesPerRacks: 3
  resources:
    requests:
      cpu: '1'
      memory: 1Gi
    limits:
      cpu: '1'
      memory: 1Gi

Note: For this and the next example, I am using a three-node Kubernetes cluster that is running in a single availability zone.

Create the CassandraCluster with:

$ kubectl apply -f soft-anti-affinity.yaml

As discussed in my previous article, CassKop creates a StatefulSet which in turn will create and manage the Cassandra pods:

$ kubectl get statefulset -l cassandracluster=soft-anti-affinity
NAME                       READY   AGE
soft-anti-affinity-dc1-rack1   3/3     16m

Let’s take a closer look at the StatefulSet soft-ant-affinity-dc1-rack1:

$ kubectl get statefulset soft-anti-affinity-dc1-rack1 -o yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  creationTimestamp: "2019-08-10T03:04:32Z"
  generation: 1
  labels:
    app: cassandracluster
    cassandracluster: soft-anti-affinity
    cassandraclusters.db.orange.com.dc: dc1
    cassandraclusters.db.orange.com.rack: rack1
    cluster: soft-anti-affinity
    dc-rack: dc1-rack1
  name: soft-anti-affinity-dc1-rack1
  namespace: default
  ownerReferences:
  - apiVersion: db.orange.com/v1alpha1
    controller: true
    kind: CassandraCluster
    name: soft-anti-affinity
    uid: 9373658d-bb1b-11e9-8455-42010a8e0041
  resourceVersion: "14661"
  selfLink: /apis/apps/v1/namespaces/default/statefulsets/soft-anti-affinity-dc1-rack1
  uid: 939d5e1f-bb1b-11e9-8455-42010a8e0041
spec:
  podManagementPolicy: OrderedReady
  replicas: 3
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: cassandracluster
      cassandracluster: soft-anti-affinity
      cassandraclusters.db.orange.com.dc: dc1
      cassandraclusters.db.orange.com.rack: rack1
      cluster: soft-anti-affinity
      dc-rack: dc1-rack1
  serviceName: soft-anti-affinity
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: cassandracluster
        cassandracluster: soft-anti-affinity
        cassandraclusters.db.orange.com.dc: dc1
        cassandraclusters.db.orange.com.rack: rack1
        cluster: soft-anti-affinity
        dc-rack: dc1-rack1
    spec:
      affinity:
        nodeAffinity: {}
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: cassandracluster
                  cassandracluster: soft-anti-affinity
                  cluster: soft-anti-affinity
              topologyKey: kubernetes.io/hostname
            weight: 100
      containers:
      - env:
        - name: CASSANDRA_MAX_HEAP
          value: 125M
        - name: CASSANDRA_SEEDS
          value: soft-anti-affinity-dc1-rack1-0.soft-anti-affinity.default,soft-anti-affinity-dc1-rack1-1.soft-anti-affinity.default,soft-anti-affinityr-dc1-rack1-2.soft-anti-affinity.default
        - name: CASSANDRA_CLUSTER_NAME
          value: soft-anti-affinity
        - name: CASSANDRA_AUTO_BOOTSTRAP
          value: "true"
        - name: CASSANDRA_AUTHENTICATOR
          value: PasswordAuthenticator
        - name: CASSANDRA_AUTHORIZER
          value: CassandraAuthorizer
        - name: POD_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: SERVICE_NAME
          value: soft-anti-affinity-dc1-rack1
        - name: CASSANDRA_GC_STDOUT
          value: "false"
        - name: CASSANDRA_NUM_TOKENS
          value: "256"
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: CASSANDRA_DC
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.labels['cassandraclusters.db.orange.com.dc']
        - name: CASSANDRA_RACK
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.labels['cassandraclusters.db.orange.com.rack']
        image: orangeopensource/cassandra-image:latest
        imagePullPolicy: Always
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/bash
              - -c
              - /etc/cassandra/pre_stop.sh
        livenessProbe:
          exec:
            command:
            - /bin/bash
            - -c
            - nodetool status
          failureThreshold: 3
          initialDelaySeconds: 120
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 20
        name: cassandra
        ports:
        - containerPort: 7000
          name: intra-node
          protocol: TCP
        - containerPort: 7001
          name: intra-node-tls
          protocol: TCP
        - containerPort: 7199
          name: jmx-port
          protocol: TCP
        - containerPort: 9042
          name: cql
          protocol: TCP
        - containerPort: 1234
          name: promjmx
          protocol: TCP
        - containerPort: 8778
          name: jolokia
          protocol: TCP
        readinessProbe:
          exec:
            command:
            - /bin/bash
            - -c
            - /ready-probe.sh
          failureThreshold: 3
          initialDelaySeconds: 60
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 10
        resources:
          limits:
            cpu: '1'
            memory: 1Gi
          requests:
            cpu: '1'
            memory: 1Gi
        securityContext:
          capabilities:
            add:
            - IPC_LOCK
          procMount: Default
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 1
        runAsNonRoot: true
        runAsUser: 1000
      terminationGracePeriodSeconds: 1800
  updateStrategy:
    rollingUpdate:
      partition: 0
    type: RollingUpdate
status:
  collisionCount: 0
  currentReplicas: 3
  currentRevision: soft-anti-affinity-dc1-rack1-78d85d9586
  observedGeneration: 1
  readyReplicas: 3
  replicas: 3
  updateRevision: soft-anti-affinity-dc1-rack1-78d85d9586
  updatedReplicas: 3

There is a lot here, but we just want to focus on the parts in bold, namely thespec.template.spec.affinity property:

template:
    metadata:
      creationTimestamp: null
      labels:
        app: cassandracluster
        cassandracluster: soft-anti-affinity
        cassandraclusters.db.orange.com.dc: dc1
        cassandraclusters.db.orange.com.rack: rack1
        cluster: soft-anti-affinity
        dc-rack: dc1-rack1
    spec:
      affinity:
        nodeAffinity: {}
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: cassandracluster
                  cassandracluster: soft-anti-affinity
                  cluster: soft-anti-affinity
              topologyKey: kubernetes.io/hostname
            weight: 100

For the soft-anti-affinity cluster, CassKop did not specify any affinity rules. This is evident from the fact that the nodeAffinity property is empty.

CassKop does however declare anti-affinity rules by setting the podAntiAffinity property. The preferredDuringSchedulingIgnoredDuringExecution rule says that a pod should not be scheduled on a node if is already running a pod that has labels that match the label selector. In other words, the Kubernetes scheduler should not schedule two Cassandra pods from soft-anti-affinity on the same node, but it will put them on the same node if there are insufficient resources to run them on separate nodes.

topologyKey essentially defines a topology domain. The matching, as specified by the label selectors, will apply to any node.

This type of scheduling rule can be convenient for development/testing scenarios in which it may be acceptable to run multiple Cassandra instances on the same Kubernetes nodes.

Note that while this anti-affinity rule results in the scheduler making a best effort to places pods from soft-anti-affinity onto separate nodes, it does not say anything about pods from other Cassandra clusters. For example, let’s say we create another CassandraCluster, another-cluster. The scheduler will happily place pods from soft-anti-affinity and another-cluster on the same node.

Let’s check the cluster status with nodetool:

$ kubectl exec soft-anti-affinity-dc1-rack1-0 -- nodetool status -r | awk {'print $1" "substr($2, 0, 30)" "$8'} | column -t
Datacenter:      dc1
===============
Status=Up/Down
|/               State=Normal/Leaving/Joining/M
--               Address                         ID
UN               soft-anti-affinity-dc1-rack1-2  rack1
UN               soft-anti-affinity-dc1-rack1-0  rack1
UN               soft-anti-affinity-dc1-rack1-1  rack1

There are a few things to note. First, the Cassandra cluster is healthy. All three Cassandra nodes are in the Up/Normal state.

Secondly, we are using a lot formatting used here. The motivation is to make the output more readable by focusing only on the most relevant parts.

Lastly, we use the -r flag withnodetool status in order to display the domain name which is also the pod name. This will help with determining on which Kubernetes node each Cassandra pod is running.

Now let’s query for pods:

$ kubectl get pods -l cassandracluster=soft-anti-affinity -o wide | awk {'print $1" "$7'} | column -t
NAME                            NODE
soft-anti-affinity-dc1-rack1-0  gke-cass-dev-default-pool-daf646ab-j2m3
soft-anti-affinity-dc1-rack1-1  gke-cass-dev-default-pool-daf646ab-5c6n
soft-anti-affinity-dc1-rack1-2  gke-cass-dev-default-pool-daf646ab-sjbh

Each pod is running on a different node. If we add another instance to the soft-anti-affinity cluster, Kubernetes will schedule the pod on one of the nodes, as long as there are sufficient resources.

Hard Anti-Affinity

In this next example, we are going to make use of the hardAntiAffinity property of CassandraCluster for the first time:

# hard-anti-affinity.yamlapiVersion: "db.orange.com/v1alpha1"
kind: "CassandraCluster"
metadata:
  name: hard-anti-affinity
  labels:
    cluster: hard-anti-affinity
spec:
  nodesPerRacks: 3
  hardAntiAffinity: true
  resources:
    requests:
      cpu: '1'
      memory: 1Gi
    limits:
      cpu: '1'
      memory: 1Gi

hardAntiAffinity defaults to false. When set to true CassKop will generate a must or hard anti-affinity rule.

CassKop creates a StatefulSet named hard-anti-affinity-dc1-rack1. Let’s take a look at its spec.template.spec.affinity property:

affinity:
  nodeAffinity: {}
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app: cassandracluster
            cassandracluster: hard-anti-affinity
            cluster: hard-anti-affinity
        topologyKey: kubernetes.io/hostname

This time we have a requireDuringSchedulingIgnoredDuringExecution rule which is a hard requirement. The scheduler will not run two pods from hard-anti-affinity on the same node.

Let’s check the cluster status with nodetool:

$ kubectl exec hard-anti-affinity-dc1-rack1-0 -- nodetool status -r | awk {'print $1" "substr($2, 0, 30)" "$8'} | column -t
Datacenter:      dc1
===============
Status=Up/Down
|/               State=Normal/Leaving/Joining/M
--               Address                         ID
UN               hard-anti-affinity-dc1-rack1-2  rack1
UN               hard-anti-affinity-dc1-rack1-0  rack1
UN               hard-anti-affinity-dc1-rack1-1  rack1

The Cassandra cluster looks healthy. All nodes are in the Up/Normal state. Now let’s query for pods:

$ kubectl get pods -l cassandracluster=hard-anti-affinity -o wide | awk {'print $1" "$7'} | column -t
NAME                            NODE
hard-anti-affinity-dc1-rack1-0  gke-cass-dev-default-pool-daf646ab-j2m3
hard-anti-affinity-dc1-rack1-1  gke-cass-dev-default-pool-daf646ab-5c6n
hard-anti-affinity-dc1-rack1-2  gke-cass-dev-default-pool-daf646ab-sjbh

Each pod is running on a different node.

Let’s see what happens if we try to add a fourth Cassandra node. We can do this by changing the spec.nodesPerRack property in hard-anti-affinity.yaml to 4 and then run:

$ kubectl apply -f hard-anti-affinity.yaml

Check the status of the CassandraCluster:

$ kubectl get cassandracluster hard-anti-affinity -o yaml
apiVersion: db.orange.com/v1alpha1
kind: CassandraCluster
...
status:
  cassandraRackStatus:
    dc1-rack1:
      cassandraLastAction:
        Name: ScaleUp
        startTime: "2019-08-12T01:57:23Z"
        status: Ongoing
      phase: Pending
      podLastOperation: {}
  lastClusterAction: ScaleUp
  lastClusterActionStatus: Ongoing
  phase: Pending
  seedlist:
  - hard-anti-affinity-dc1-rack1-0.hard-anti-affinity.default
  - hard-anti-affinity-dc1-rack1-1.hard-anti-affinity.default
  - hard-anti-affinity-dc1-rack1-2.hard-anti-affinity.default

The status reports phase: Pending. The new pod has been created but not scheduled. Let’s take a look at the new pod:

$ kubectl get pod hard-anti-affinity-dc1-rack1-3 -o yaml
apiVersion: v1
kind: Pod
...
status:
  conditions:
  - lastProbeTime: "2019-08-12T02:15:45Z"
    lastTransitionTime: "2019-08-12T02:14:41Z"
    message: '0/3 nodes are available: 3 node(s) didn''t match pod affinity/anti-affinity,
      3 node(s) didn''t satisfy existing pods anti-affinity rules.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending
  qosClass: Guaranteed

The message tells us that no node satisfies the anti-affinity rule; therefore, the hard-anti-affinity-dc1-rack1-3 pod cannot run.

Hard Affinity

We are going to create a CassandraCluster that consists of a single data center with two racks. Each rack will be placed in a separate availability zone.

# hard-affinity.yamlapiVersion: "db.orange.com/v1alpha1"
kind: "CassandraCluster"
metadata:
  name: hard-affinity
  labels:
    cluster: hard-affinity
spec:
  nodesPerRacks: 3
  resources:
    requests:
      cpu: '1'
      memory: 1Gi
    limits:
      cpu: '1'
      memory: 1Gi
  topology:
    dc:
      - name: dc1
        rack:
          - name: rack1
            labels:
              failure-domain.beta.kubernetes.io/zone: us-east1-b
          - name: rack2
            labels:
              failure-domain.beta.kubernetes.io/zone: us-east1-c

This is the first time I have used the labels property of a rack. Any label can be specified; however, the label used here is not arbitrary.

Nodes in Kubernetes come pre-populated with a handful of labels, one of which is failure-domain.beta.kubernetes.io/zone. We are using the rack labels to tell CassKop to place rack1 in zone us-east1-b and rack2 in zone us-east1-c.

CassKop will create two StatefulSets for hard-affinity, one for each rack. Let’s take a look at the spec.template.spec.affinity property for one of the StatefulSets, hard-affinity-dc1-rack1:

template:
    metadata:
      creationTimestamp: null
      labels:
        app: cassandracluster
        cassandracluster: hard-affinity
        cassandraclusters.db.orange.com.dc: dc1
        cassandraclusters.db.orange.com.rack: rack1
        cluster: hard-affinity
        dc-rack: dc1-rack1
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: failure-domain.beta.kubernetes.io/zone
                operator: In
                values:
                - us-east1-b
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: cassandracluster
                  cassandracluster: hard-affinity
                  cluster: hard-affinity
              topologyKey: kubernetes.io/hostname
            weight: 100

CassKop has declared a requiredDuringSchedulingIgnoredDuringExecution rule which says that nodes must be scheduled in zone us-east1-b. We also have the anti-affinity rule that tells the scheduler to make a best effort to schedule pods on separate nodes.

CassKop creates a similar affinity rule for hard-affinity-dc1-rack2 but the value in the match expression is set to us-east1-c.

Let’s check the cluster status with nodetool:

$ kubectl exec hard-affinity-dc1-rack2-0 -- nodetool status -r | awk {'print $1" "substr($2, 0, 25)" "$8'} | column -t
Datacenter:      dc1
===============
Status=Up/Down
|/               State=Normal/Leaving/Join
--               Address                    ID
UN               hard-affinity-dc1-rack2-1  rack2
UN               hard-affinity-dc1-rack1-2  rack1
UN               hard-affinity-dc1-rack2-0  rack2
UN               hard-affinity-dc1-rack1-1  rack1
UN               hard-affinity-dc1-rack2-2  rack2
UN               hard-affinity-dc1-rack1-0  rack1

The Cassandra cluster looks healthy. All nodes are in the Up/Normal state. We have six Cassandra nodes spread across two racks.

The nodetool output does not provide enough information though to confirm that pods are running in the desired zones. Let’s inspect a bit more to verify that the pods are running where we expect them to be.

The Kubernetes cluster used for this example is made up of the following nodes:

Node                                              | Zone
-------------------------------------------------------------
gke-cass-dev-multizone-default-pool-018f2d3d-0lbl | us-east1-d
gke-cass-dev-multizone-default-pool-018f2d3d-6lnk | us-east1-d
gke-cass-dev-multizone-default-pool-018f2d3d-7760 | us-east1-d
--------------------------------------------------|----------
gke-cass-dev-multizone-default-pool-112f0d40-564v | us-east1-c
gke-cass-dev-multizone-default-pool-112f0d40-db52 | us-east1-c
gke-cass-dev-multizone-default-pool-112f0d40-fdmg | us-east1-c
-------------------------------------------------------------
gke-cass-dev-multizone-default-pool-8060f7a0-j6sl | us-east1-b
gke-cass-dev-multizone-default-pool-8060f7a0-qgl2 | us-east1-b
gke-cass-dev-multizone-default-pool-8060f7a0-xm4t | us-east1-b

Although this cluster has 3 nodes in us-east1-d, no Cassandra pods should be running in that zone.

Now let’s query for pods:

$ kubectl get pods -l cassandracluster=hard-affinity -o wide | awk {'print $1" "$7'} | column -t
NAME                       NODE
hard-affinity-dc1-rack1-0  gke-cass-dev-multizone-default-pool-8060f7a0-j6sl
hard-affinity-dc1-rack1-1  gke-cass-dev-multizone-default-pool-8060f7a0-xm4t
hard-affinity-dc1-rack1-2  gke-cass-dev-multizone-default-pool-8060f7a0-qgl2
hard-affinity-dc1-rack2-0  gke-cass-dev-multizone-default-pool-112f0d40-db52
hard-affinity-dc1-rack2-1  gke-cass-dev-multizone-default-pool-112f0d40-fdmg
hard-affinity-dc1-rack2-2  gke-cass-dev-multizone-default-pool-112f0d40-564v

The pods in rack1 are all running on nodes in zone us-east1-b, and the pods in rack2 are running on nodes in zone us-east1-c. All pods are running where we expect them to be running!

Conclusion

Different types of applications will have different performance profiles and different resource requirements. Cassandra boasts high performance, but that is largely dependent on available resources in terms of CPU, memory, and I/O. CassKop provides a flexible and intuitive way to make sure Cassandra runs on specific Kubernetes nodes. These scheduling capabilities are necessary components to be able to effectively run Cassandra in Kubernetes for production workloads.

Create Affinity between Cassandra and Kubernetes

Soft Anti-affinity

Hard Anti-Affinity

Hard Affinity

Conclusion

Written by John Sanda