Superior Rollout Strategies: Customized Methods for Stateful Apps in Kubernetes

Superior Rollout Strategies: Customized Methods for Stateful Apps in Kubernetes
Superior Rollout Strategies: Customized Methods for Stateful Apps in Kubernetes

In a earlier weblog submit—A Easy Kubernetes Admission Webhook—I mentioned the method of making a Kubernetes webhook with out counting on Kubebuilder. At Slack, we use this webhook for numerous duties, like serving to us help long-lived Pods (see Supporting Long-Lived Pods), and at this time, I delve as soon as extra into the subject of long-lived Pods, specializing in our strategy to deploying stateful functions by way of custom resources managed by Kubebuilder.

Lack of management

Lots of our groups at Slack use StatefulSets to run their functions with stateful storage, so StatefulSets are naturally match for distributed caches, databases, and different stateful providers that depend on distinctive Pod id and protracted exterior volumes.

Natively in Kubernetes, there are two methods of rolling out StatefulSets, two replace methods, set through the .spec.updateStrategy discipline:

  • When a StatefulSet’s .spec.updateStrategy.kind is ready to OnDelete, the StatefulSet controller won’t mechanically replace the Pods in a StatefulSet. Customers should manually delete Pods to trigger the controller to create new Pods that replicate modifications made to a StatefulSet’s .spec.template.

  • The RollingUpdate replace technique implements automated, rolling updates for the Pods in a StatefulSet. That is the default replace technique.

RollingUpdate comes filled with options like Partitions (percent-based rollouts) and .spec.minReadySeconds to decelerate the tempo of the rollouts. Sadly the maxUnavailable field for StatefulSet remains to be alpha and gated by the MaxUnavailableStatefulSet api-server characteristic flag, making it unavailable to be used in AWS EKS on the time of this writing.

Which means utilizing RollingUpdate solely lets us roll out one Pod at a time, which will be excruciatingly sluggish to deploy functions with a whole bunch of Pods.

OnDelete nevertheless lets the consumer management the rollout by deleting the Pods themselves, however doesn’t include RollingUpdate’s bells and whistles like percent-based rollouts.

Our inner groups at Slack have been asking us for extra managed rollouts: they wished sooner percent-based rollouts, sooner rollbacks, the power to pause rollouts, an integration with our inner service discovery (Consul), and naturally, an integration with Slack to replace groups on rollout standing.

Bedrock rollout operator

So we constructed the Bedrock Rollout Operator: a Kubernetes operator that manages StatefulSet rollouts. Bedrock is our inner platform; it gives Slack engineers opinionated configuration for Kubernetes deployments through a easy config interface and highly effective, easy-to-use integrations with the remainder of Slack, comparable to:

…and it has nothing to do with AWS’ new generative AI service of the identical title!

We constructed this operator with Kubebuilder, and it manages a {custom} useful resource named StatefulsetRollout. The StatefulsetRollout useful resource accommodates the StatefulSet Spec in addition to additional parameters to offer numerous additional options, like pause and Slack notifications. We’ll have a look at an instance in a later part on this submit.


At Slack, engineers deploy their functions to Kubernetes through the use of our inner Bedrock tooling. As of this writing, Slack has over 200 Kubernetes clusters, over 50 stateless providers (Deployments) and practically 100 stateful providers (StatefulSets). The operator is deployed to every cluster, which lets us management who can deploy the place. The diagram under is a simplification displaying how the items match collectively:

Rollout circulation

Following the diagram above, right here’s an end-to-end instance of a StatefulSet rollout.

1. bedrock.yaml

First, Slack engineers write their intentions in a `bedrock.yaml` config file saved of their app repository on our inner Github. Right here’s an instance:

   dockerfile: Dockerfile

       degree: "debug"
       channel: "#devel-rollout-operator-notifications"
   sort: StatefulSet
     max_unavailable: 50%
     - picture: bedrock-tester
       technique: OnDelete
         min_pod_eviction_interval_seconds: 10
           - 1
           - 50
           - 100
         - playground
       replicas: 2

2. Launch UI

Then, they go to our inner deploy UI to impact a deployment:

3. Bedrock API

The Launch platform then calls to the Bedrock API, which parses the consumer bedrock.yaml and generates a StatefulsetRollout useful resource:

sort: StatefulsetRollout
 annotations: grasp [email protected]:slack/bedrock-tester.git
   app: bedrock-tester-sts-dev v1.custom-1709074522
 title: bedrock-tester-sts-dev
 namespace: default
   bapiUrl: http://bedrock-api.inner.url
   stageId: 2dD2a0GTleDCxkfFXD3n0q9msql
 channel: '#devel-rollout-operator-notifications'
 minPodEvictionIntervalSeconds: 10
 pauseRequested: false
 p.c: 25
 rolloutIdentity: GbTdWjQYgiiToKdoWDLN
   dc: cloud1
   - bedrock-tester-sts
   apiVersion: apps/v1
   sort: StatefulSet
     annotations: [email protected]:slack/bedrock-tester.git
       app: bedrock-tester-sts-dev
     title: bedrock-tester-sts-dev
     namespace: default
     replicas: 4
         app: bedrock-tester-sts-dev
  [email protected]:slack/bedrock-tester.git
           app: bedrock-tester-sts-dev
           title: bedrock-tester
       kind: OnDelete

Let’s have a look at the fields within the high degree of the StatefulsetRollout spec, which give the additional functionalities:

  • bapi: This part accommodates the main points wanted to name again to the Bedrock API as soon as a rollout is full or has failed
  • channel: The Slack channel to ship notifications to
  • minPodEvictionIntervalSeconds: Non-compulsory; the time to attend between every Pod rotation
  • pauseRequested: Non-compulsory; will pause an ongoing rollout if set to true
  • p.c: Set to 100 to roll out all Pods, or much less for a percent-based deploy
  • rolloutIdentity: We go a randomly generated string to this rollout as a solution to allow retries when a rollout has failed however the challenge was transient.
  • serviceDiscovery: This part accommodates the main points associated to the service Consul registration. That is wanted to question Consul for the well being of the service as a part of the rollout.

Observe that the disruption_policy.max_unavailable that was current within the bedrock.yaml doesn’t present up within the {custom} useful resource. As an alternative, it’s used to create a Pod disruption policy. At run-time, the operator reads the Pod disruption coverage of the managed service to determine what number of Pods it could possibly roll out in parallel.

4. Bedrock Rollout Operator

Then, the Bedrock Rollout Operator takes over and converges the present state of the cluster to the specified state outlined within the StatefulsetRollout. See “The Reconcile Loop”  part under for extra particulars.

5. Slack notifications

We used Block Kit Builder to design wealthy Slack notifications that inform customers in actual time of the standing of the continuing rollout, offering particulars just like the model quantity and the checklist of Pods being rolled out:

6. Callbacks

Whereas Slack notifications are good for the tip customers, our techniques additionally must know the state of the rollout. As soon as completed converging a StatefulsetRollout useful resource, the Operator calls again to the Bedrock API to tell it of the success or failure of the rollout. Bedrock API then sends a callback to Launch for the standing of rollout to be mirrored within the UI.

The reconcile loop

The Bedrock Rollout Operator watches the StatefulsetRollout useful resource representing the desired state of the world, and reconciles it in opposition to the actual world. This implies, for instance, creating a brand new StatefulSet if there isn’t one, or triggering a brand new rollout. A typical rollout is finished by making use of a brand new StatefulSet spec after which terminating a desired quantity of Pods (half of them in our p.c: 50 instance).

The core performance of the operator lies throughout the reconcile loop during which it:

  1. Appears on the anticipated state: the spec of the {custom} useful resource
  2. Appears on the state of the world: the spec of the StatefulSet and of its Pods
  3. Takes actions to maneuver the world nearer to the anticipated state, for instance by:
    • Updating the StatefulSet with the newest spec offered by the consumer; or by
    • Evicting Pods to get them changed by Pods working the newer model of the applying being rolled out

When the {custom} useful resource is up to date, we start the reconciliation loop course of. Sometimes after that, Kubernetes controllers watch the sources they give the impression of being after and work in an event-driven style. Right here, this might imply watching the StatefulSets and its Pods. Every time one in all them will get up to date, the reconcile loop would run.

However as an alternative of working on this event-driven approach, we determined to work by enqueuing the following reconcile loop ourselves: so long as we’re anticipating change, we re-enqueue a request sooner or later. As soon as we attain a closing state like RolloutDone or RolloutFailed, we merely exit with out re-enqueueing. Working on this style has a couple of benefits and results in quite a bit much less reconciliations. It additionally enforces reconciliations being performed sequentially for a given {custom} useful resource, which dodges race circumstances introduced by mutating a given {custom} useful resource in reconcile loops working in parallel.

Right here’s a non-exhaustive circulation chart illustrating the way it works for our StatefulsetRollout (Sroll for brief) {custom} useful resource:

As you may see, we’re attempting to do as little as we will in every reconciliation loop: we take one motion and re-enqueue a request a couple of seconds sooner or later. This works effectively as a result of it retains every loop quick and so simple as attainable, which makes the operator resilient to disruptions. We obtain this by saving the final resolution the loop took, specifically the `Section` data, within the standing of the {custom} useful resource. Right here’s what the StatefulsetRollout standing struct seems like:

// StatefulsetRolloutStatus defines the noticed state of StatefulsetRollout
kind StatefulsetRolloutStatus struct 
 // The Section is a excessive degree abstract of the place the StatefulsetRollout is in its lifecycle.
 Section RolloutPhase `json:"section,omitempty"`
 // PercentRequested ought to match Spec.% on the finish of a rollout
 PercentRequested int `json:"percentDeployed,omitempty"`
 // A human readable message indicating particulars about why the StatefulsetRollout is on this section.
 Purpose string `json:"purpose,omitempty"`
 // The variety of Pods presently displaying prepared in kube
 ReadyReplicas int `json:"readyReplicas"`
 // The variety of Pods presently displaying prepared in service discovery
 ReadyReplicasServiceDiscovey int `json:"readyReplicasRotor,omitempty"`
 // Paused signifies that the rollout has been paused
 Paused bool `json:"paused,omitempty"`
 // Deleted signifies that the statefulset underneath administration has been deleted
 Deleted bool `json:"deleted,omitempty"`
 // The checklist of Pods owned by the managed sts
 Pods []Pod `json:"Pods,omitempty"`
 // ReconcileAfter signifies if the controller ought to enqueue a reconcile for a future time
 ReconcileAfter *metav1.Time `json:"reconcileAfter,omitempty"`
 // LastUpdated is the time at which the standing was final up to date
 LastUpdated *metav1.Time `json:"lastUpdated"`
 // LastCallbackStageId is the BAPI stage ID of the final callback despatched
 LastCallbackStageId string `json:"lastCallbackStageId,omitempty"`
 // BuildMetadata like department and commit sha
 BuildMetadata BuildMetadata `json:"buildMetadata,omitempty"`
 // SlackMessage is used to replace an present message in Slack
 SlackMessage *SlackMessage `json:"slackMessage,omitempty"`
 // ConsulServices tracks if the consul providers laid out in spec.ServiceDiscovery exists
 // can be nil if no providers exist in service discovery
 ConsulServices []string `json:"consulServices,omitempty"`
 // StatefulsetName tracks the title of the statefulset underneath administration.
 // If no statefulset exists that matches the anticipated metadata, this discipline is left clean
 StatefulsetName string `json:"statefulsetName,omitempty"`
 // True if the statefulset underneath administration's spec matches the sts Spec in StatefulsetRolloutSpec.sts.spec
 StatefulsetSpecCurrent bool `json:"statefulsetSpecCurrent,omitempty"`
 // RolloutIdentity is the id of the rollout requested by the consumer
 RolloutIdentity string `json:"rolloutIdentity,omitempty"`

This standing struct is how we maintain observe of all the things and so we save plenty of metadata right here —all the things from the Slack message ID, to a listing of managed Pods that features which model every is presently working.

Limitations and studying

Supporting giant apps

Slack manages a major quantity of site visitors, which we again with sturdy providers working on our Bedrock platform constructed on Kubernetes:


This offers an instance of the dimensions we’re coping with. But, we bought stunned once we discovered that a few of our StatefulSets spin as much as 1,000 Pods which precipitated our Pod by Pod notifications to get charge restricted as we have been sending one Slack message per Pod, and rotating as much as 100 Pods in parallel! This pressured us to rewrite the notifications stack within the operator: we launched pagination and moved to sending messages containing as much as 50 Pods.

Model leak

A few of you may need picked up on a not-so-subtle element associated to the (ab-)use of the OnDelete technique for StatefulSets: what we internally name the model leak challenge. When a consumer decides to do a percent-based rollout, or pause an present rollout, the StatefulSet is left with some Pods working the brand new model and a few Pods working the earlier model. But when a Pod working the earlier model will get terminated for another purpose than being rolled out by the operator, it’ll get changed by a Pod working the brand new model. Since we routinely terminate nodes for quite a few causes comparable to scaling clusters, rotating nodes for compliance in addition to chaos engineering, a stopped rollout will, over time, are likely to converge in direction of being totally rolled out. Fortuitously, this can be a well-understood limitation and Slack engineering groups deploy their providers out to 100% in a well timed method earlier than the model leak drawback would come up.

What’s subsequent?

We’ve discovered the Kubernetes operator mannequin to be efficient, so we have now chosen to handle all Kubernetes deployments utilizing this strategy. This doesn’t essentially contain extending our StatefulSet operator. As an alternative, for managing Deployment sources, we’re exploring present CNCF initiatives comparable to Argo Rollouts and OpenKruise.


Implementing {custom} rollout logic in a Kubernetes operator is just not easy work, and incoming Kubernetes options just like the maxUnavailable discipline for StatefulSet would possibly, sooner or later, allow us to pull out a few of our {custom} code. Managing rollouts in an operator is a mannequin that we’re pleased with, for the reason that operator permits us to simply ship Slack notifications for the state of rollouts in addition to combine with a few of our different inner techniques like Consul. Since this sample has labored effectively for us, we goal to increase the usage of the operator sooner or later.

Love Kube and deploy techniques? Come be a part of us! Apply now