How Pinterest Leverages Honeycomb to Improve CI Observability and Enhance CI Construct Stability | by Pinterest Engineering | Pinterest Engineering Weblog | Dec, 2024
Oliver Koo | Workers Software program Engineer
At Pinterest, our cell infrastructure is core to delivering a high-quality expertise for our customers. On this weblog, I’ll showcase how the Pinterest Cellular Builds crew is leveraging Honeycomb(™) (beginning in 2021) to boost observability and efficiency in our cell builds and steady integration (CI) workflows.
Our cell builds crew depends on Honeycomb(™) as a strong knowledge engine to visualise construct metrics, analyze tendencies, and make data-driven choices. From monitoring construct instances to categorizing errors, Honeycomb(™) empowers us with crucial insights into our CI workflows, enabling us to proactively handle points and optimize efficiency.
We’ve constructed dashboards that set up baseline metrics, monitoring key CI indicators like construct instances, pipeline success charges, and cluster utilization for each iOS and Android builds. Whereas many knowledge platforms or CI suppliers can supply these capabilities out of the field, the actual magic occurs when we have to go deeper — when tendencies look irregular, or when nuanced evaluation is required to uncover hidden points.
That is the place Honeycomb(™) really excels. Its intuitive question builder makes slicing and dicing knowledge seamless, enabling us to drill into granular particulars with ease. Options like derived columns allow us to create dynamic metrics on the fly, whereas its blazing-fast efficiency ensures that even with 1 million occasions despatched every day[1] only for our CI construct dataset, most queries are accomplished in beneath a second.
This unparalleled visibility transforms how we perceive and enhance our CI processes. We are able to pinpoint bottlenecks, diagnose points in close to real-time, and implement enhancements sooner than ever earlier than — all with the arrogance that we’re making knowledgeable, data-driven choices. Honeycomb(™) doesn’t simply give us knowledge, it offers us readability.
Listed here are a few examples of how I used Honeycomb(™) to uncover attention-grabbing patterns in our CI builds and pinpoint bottlenecks.
Recognizing Bottleneck Jobs in Builds
When querying our construct counts together with p95 and p50 construct instances in a CI pipeline, I seen two distinct situations:
- On the left, there’s a spike in construct depend, however the p95 and p50 metrics stay unchanged. Since construct instances aren’t impacted, there’s no want to research additional, permitting me to avoid wasting time and focus elsewhere.
- On the best, the construct quantity stays constant, however there’s a noticeable spike within the p95 construct time. This deviation is price investigating additional.
By clicking into the precise construct inflicting the spike, I can view the construct hint. In Honeycomb(™) phrases, a “hint” represents a whole unit of labor for a number of providers in an setting. In our case, the hint corresponds to a CI construct, with baby spans representing particular person jobs inside that construct. These spans can embrace baby traces for job steps, akin to script execution or different duties inside a job.
The hint view revealed that one job — “tremendous secretive checks” — was taking considerably longer to finish, turning into the bottleneck and inflicting the spike in p95 construct time. Since one sluggish construct isn’t sufficient to maneuver the p95 metric, I hypothesized that related slowdowns have been occurring throughout different builds. To analyze additional, I looked for the Buildkite URLs utilizing the web_url attribute in Honeycomb(™) to investigate extra builds straight in Buildkite.
You would possibly discover this hint view is similar to the “Waterfall View” that Buildkite launched in 2023. Nevertheless, we proceed to make use of Honeycomb’s (™) hint view for a number of causes:
- Seamless Integration with Honeycomb(™): The hint view integrates straight with Honeycomb(™), permitting us to seamlessly transition from analyzing construct tendencies to zooming into particular builds for a deeper dive.
- Flexibility and Customization: Honeycomb’s hint view offers us the flexibility to interrupt down Buildkite builds into extra than simply jobs, however into particular segments akin to agent wait time and script execution. It permits us to log and analyze the elements of the construct and job which might be most related to our workflows, such because the execution of varied construct hooks or setting setups. We are able to even go deeper and instrument construct scripts to log the construct time of particular segments throughout the script. For instance my factors, I created a demo picture under utilizing dummy knowledge. This picture demonstrates an instance construct the place every Buildkite job is damaged down right into a sequence of executions throughout the job. Moreover, inside a Bazel construct script, we instrumented the method to log the execution time of particular Bazel targets. If desired, you might even log the construct time for every goal individually. The chances are limitless!
Later we will mixture these segments to reply questions like, “What’s the common repo cloning time throughout totally different pipelines?” or, “What are the p50 and p95 instances for Bazel construct and check phases in my PR pipeline?” These are worthwhile observability metrics that may assist your crew prioritize optimizations, cut back construct instances, and enhance general developer productiveness.
3. Established Habits: We’ve been utilizing Honeycomb’s hint view since 2021, lengthy earlier than the Waterfall View was launched. By now, it’s turn into a well-known and trusted a part of our course of.
These benefits make Honeycomb’s hint view a useful device for understanding our CI processes, diagnosing points, and bettering effectivity.
Utilizing Correlation to Determine Potential Root Causes
Honeycomb’s correlation characteristic is one other recreation changer. It permits us to overlay question outcomes with different dashboards, making a breadcrumb path to establish abnormalities or outliers.
As an illustration, I noticed a spike in p95 construct instances for iOS CI jobs. Utilizing correlation, I in contrast the p95 knowledge to CI cluster utilization graphs and seen a simultaneous spike in job wait instances. Honeycomb’s synchronized dotted line throughout graphs confirmed the alignment, resulting in a powerful speculation: lengthy CI agent wait instances have been inflicting the construct time spike.
From there, I clicked into the construct hint to substantiate my speculation. Positive sufficient, the hint revealed that the construct skilled unusually lengthy wait instances for CI brokers. By sampling extra builds from the identical time interval, I might verify the basis trigger and deal with options.
With out Honeycomb(™), conducting such a investigation could be extremely tedious — requiring a handbook, build-by-build evaluation. Honeycomb(™) offers a holistic view that permits you to rapidly pinpoint root causes, saving effort and time whereas bettering our CI course of effectivity.
One among our latest initiatives with Honeycomb(™) is error categorization for cell builds. Whereas nonetheless in its early phases, the outcomes have been promising. Our major objectives are:
- Deeper Perception into Construct Failures: CI construct failures can stem from numerous causes, akin to compilation errors, flaky checks, or community points. By analyzing logs and extracting particular errors, we’ve recognized the highest contributors to CI instability. This perception permits us to prioritize sources and handle crucial points extra successfully.
- Streamlining On-Name and Lowering Noise: Traditionally, our crew was notified of each CI subject, whatever the root trigger. With error categorization, we will now classify failure varieties in actual time and automate alerts, routing them to the suitable crew’s on-call channel. This streamlines on-call duties and minimizes interruptions. As an illustration, check failures now routinely notify the accountable crew with out requiring our intervention.
Whereas the system remains to be being refined, it has already confirmed to be a worthwhile device for enhancing CI administration effectivity. The diagram under illustrates the structure of our error categorization system, showcasing how we combine Buildkite logs with Honeycomb(™) by leveraging AWS EventBridge and the Buildkite Jobs API.
Whereas Honeycomb(™) is important for CI observability, its purposes lengthen past construct metrics. Groups throughout Pinterest use it to realize real-time insights into efficiency knowledge and tailor observability to their wants.
As an illustration, we observe iOS native construct metrics alongside machine particulars in Honeycomb(™), which helps us prioritize laptop computer upgrades for builders. One other use case includes analyzing Android Develocity construct knowledge (learn extra about this in one other Pinterest Engineering weblog put up).
At Pinterest, we’re repeatedly bettering our construct processes, and Honeycomb(™) has been a vital companion on this journey. We’re excited to discover new use circumstances and increase our data-driven observability practices, enabling our groups to deal with delivering distinctive consumer experiences.
To study extra about engineering at Pinterest, try the remainder of our Engineering Weblog and go to our Pinterest Labs website. To discover and apply to open roles, go to our Careers web page.