Useful resource Administration with Apache YuniKorn™ for Apache Spark™ on AWS EKS at Pinterest | by Pinterest Engineering | Pinterest Engineering Weblog | Oct, 2024
Yongjun Zhang; Workers Software program Engineer | William Tom; Workers Software program Engineer | Sandeep Kumar; Software program Engineer | Hengzhe Guo; Software program Engineer |
Monarch, Pinterest’s Batch Processing Platform, was initially designed to assist Pinterest’s ever-growing variety of Apache Spark and MapReduce workloads at scale. Throughout Monarch’s inception in 2016, essentially the most dominant batch processing know-how round to construct the platform was Apache Hadoop YARN. Now, eight years later, now we have made the choice to maneuver off of Apache Hadoop and onto our subsequent era Kubernetes (K8s) based mostly platform. These are a number of the key points we purpose to deal with:
- Software isolation with containerization: In Apache Hadoop 2.10, YARN functions share the identical frequent atmosphere with out container isolation. This usually results in arduous to debug dependency conflicts between functions.
- GPU assist: Node labeling assist was added to Apache Hadoop YARN’s Capability Scheduler (YARN-2496) and never Truthful Scheduler (YARN-2497), however at Pinterest we’re closely invested in Truthful Scheduler. Upgrading to a more moderen Apache Hadoop model with node labeling assist in Truthful Scheduler or migrating to Capability Scheduler would require great engineering effort.
- Hadoop improve effort: In 2020, we upgraded from Apache Hadoop 2.7 to 2.10. This minor model improve course of took roughly one yr. A significant model improve to three.x will take us considerably extra time.
- Hadoop group assist: The business as an entire has been transferring to K8s and Apache Hadoop is basically in upkeep mode.
Over the previous few years, we’ve noticed widespread business adoption of K8s to deal with these challenges. After a prolonged inner proof of idea and analysis we made the choice to construct out our subsequent era K8s-based platform: Moka (Monarch on K8s).
On this weblog, we’ll be masking the challenges and corresponding options that allow us emigrate hundreds of Spark functions from Monarch to Moka whereas sustaining useful resource effectivity and a top quality of service. You possibly can try our earlier weblog, which describes how Monarch assets are managed effectively to attain price saving whereas making certain job stability. On the time of scripting this weblog, now we have migrated half of the Spark workload operating on Monarch to Moka. The remaining MapReduce jobs operating on Monarch are actively being migrated to Spark as a part of a separate initiative to deprecate MapReduce inside Pinterest.
Our purpose with Moka useful resource administration was to retain the constructive properties of Monarch’s useful resource administration and enhance upon its shortcomings.
First, let’s cowl the issues that labored properly on Monarch that we needed to deliver over to Moka:
- Associating every workflow with its proprietor’s group and venture
- Classifying all workflows into three tiers: tier 1 (highest precedence), tier 2, and tier 3, and defining the runtime SLO for every workflow
- Utilizing hierarchical org-base-queue (OBQ) within the format of
root.<group>.<venture>.<tier> for useful resource allocation and workload scheduling - Monitoring per-application runtime knowledge, together with the applying’s begin and finish time, reminiscence and vCore utilization, and so forth.
- Environment friendly tier-based useful resource allocation algorithms that mechanically alter OBQs based mostly on the historic useful resource utilization of the workflows assigned to the queues
- Companies that may route workloads from one cluster to a different, which we name cross-cluster-routing (CCR), to their corresponding OBQs
- Onboarding queues with reserved assets to onboard new workloads
- Periodic useful resource allocation processes handle OBQs from the useful resource utilization knowledge collected throughout the onboarding queue runs
- Dashboards to observe useful resource utilization and workflow runtime SLO efficiency
Step one to programmatically migrating workloads from Monarch to Moka at scale is useful resource allocation. Many of the gadgets listed above might be re-used as-is or simply prolonged to assist Moka. Nevertheless, there are a couple of extra gadgets that we would wish to assist useful resource allocation on Moka:
- A scheduler that’s application-aware, permits managing cluster assets as hierarchical queues, and helps preemption
- A pipeline to export useful resource utilization info for all functions run on Moka and ingestion pipelines to generate perception tables
- A useful resource allocation job that makes use of the insights tables to generate the OBQ useful resource allocation
- An orchestration layer that is ready to route workloads to their goal clusters and OBQs
Now let’s go into element on how we solved these 4 lacking items.
The default K8s scheduler is a “jack of all trades, grasp of none” resolution that isn’t significantly adept at scheduling batch processing workloads. For our useful resource scheduling wants, we wanted a scheduler that helps hierarchical queues and is ready to schedule on a per-application and per-user foundation as an alternative of per-pod foundation.
Apache YuniKorn is designed by a gaggle of engineers with deep expertise engaged on Apache Hadoop YARN and K8s. Apache YuniKorn not solely acknowledges customers, functions, and queues, but additionally consists of many different elements, reminiscent of ordering, priorities, and preemption when making scheduling choices.
On condition that Apache YuniKorn has essentially the most attributes we want, we determined to make use of it in Moka.
As talked about earlier, utility useful resource utilization historical past is vital for the way we do useful resource allocation. At a excessive stage, we use the historic utilization of a queue as a baseline to estimate how a lot ought to be allotted going ahead in every iteration of the allocation course of. Nevertheless, after we made the choice to first undertake Apache YuniKorn it was lacking this very important function. Apache YuniKorn was solely stateless and solely tracked instantaneous useful resource utilization throughout the cluster.
We wanted an answer that might be capable of reliably observe useful resource consumption of all pods belonging to an utility and was fault tolerant. For this, we labored carefully with the Apache YuniKorn group so as to add assist for logging useful resource utilization info for completed functions (see YUNIKORN-1385 for extra info). This function aggregates pod useful resource utilization per utility and stories a closing useful resource utilization abstract upon utility completion.
This abstract is logged to Apache YuniKorn’s stdout the place we use Fluent Bit to filter out the app abstract logs and write them to S3.
By design, the Apache YuniKorn utility abstract accommodates related data as YARN’s utility abstract produced by YARN ResourceManager (see extra particulars on this document) in order that it could match seamlessly into current customers of YARN utility summaries.
Along with utility useful resource utilization info, the next mapping info is mechanically ingested into the perception tables:
- Workflow to venture, tier, SLO
- Mission to proprietor
- Proprietor to firm group
- Workflow job to functions
This info is used for associating workloads with their goal queue and estimating the queue’s future useful resource necessities.
Determine 1 exhibits the knowledge ingestion and useful resource allocation circulation.
To study extra in regards to the useful resource allocation algorithm, see our earlier weblog submit: Environment friendly Useful resource Administration at Pinterest’s Batch Processing Platform.
Our algorithm prioritizes useful resource allocation to tier 1 and tier 2 queues as much as a specified percentile threshold of the queue’s required assets.
One draw back to this strategy is that it usually requires a number of iterations of useful resource allocation to converge to a secure one, the place every iteration requires some guide tuning of parameters.
As a part of the Moka migration, we designed and carried out a model new algorithm that leverages Constraint Programming Using CP-SAT from the OR-Tools open supply suite for optimization. This device generates a mannequin by developing a set of constraints based mostly on the utilization/capability ratio hole between excessive tier (tier 1 and a couple of) and low tier (tier 3) useful resource requests. This new useful resource allocation algorithm runs sooner and extra reliably with out human intervention.
Our job submission layer, Archer, is chargeable for dealing with all job submissions to Moka. Archer offers flexibility on the platform stage to research, modify and route jobs at submission time. This consists of routing jobs to particular Moka clusters and queues.
Determine 2 exhibits how we do useful resource allocation with CCR for choose jobs and the deployment course of. The useful resource allocation change made at git repo is mechanically submitted to Archer, and Archer talks to k8s to deploy the modified useful resource allocation config map, after which route jobs at runtime based mostly on the CCR guidelines arrange within the Archer Routing DB.
We plan to cowl Archer and Moka in future weblog posts.
Along with YUNIKORN-1385, listed below are another options and fixes we contributed again to the Apache YuniKorn group.
- YUNIKORN-790: Provides assist for maxApplications to restrict the variety of functions that may run concurrently in any queue
- YUNIKORN-2030: Fixes a bug when checking headroom which causes Apache YuniKorn to stall
- YUNIKORN-970 Add queue metrics with queue names as tags to make queue metrics simpler to trace and examine
- YUNIKORN-1948 Introduce a command to validate the content material of a given queue config file
Apache YuniKorn remains to be a comparatively younger open supply venture, and we’re persevering with to work with the group collectively to complement its performance and enhance its reliability and effectivity.
As soon as we began to onboard actual manufacturing workloads to Moka, we prolonged our current Workflow SLO Efficiency Dashboards for Monarch to incorporate the day by day runtime outcomes of apps operating on Moka. Our purpose is to make sure no less than 90% of tier 1 workflows meet their SLO no less than 90% of time.
Regardless of having made nice progress constructing out the Moka platform and migrating jobs from our legacy platform there are nonetheless many enhancements now we have deliberate. To present you a teaser of what’s to return:
We’re within the strategy of designing a stateful service that is ready to leverage YUNIKORN-1628 and YUNIKORN-2115 which introduce occasion streaming assist.
As well as, we’re engaged on a full-featured useful resource administration console to handle the platform assets. This console will allow platform directors to observe cluster and queue useful resource utilization in actual time and permits customized load balancing between clusters.
To start with, because of the Apache YuniKorn group for his or her help and collaboration after we have been evaluating Apache YuniKorn and for working with us to submit patches we made internally again to the open supply venture.
Subsequent, because of our good teammates, Rainie Li, Hengzhe Guo, Soam Acharya, Bogdan Pisica, Aria Wang from the Batch Processing Platform and Information Processing Infra groups for his or her work on Moka. Thanks Ang Zhang, Jooseong Kim, Chen Qin, Soam Acharya, Chunyan Wang et al for his or her assist and insights whereas we have been engaged on the venture.
Final however not least, thanks to the Workflow Platform workforce and the entire shopper groups of our platform for his or her suggestions.
To study extra about engineering at Pinterest, try the remainder of our Engineering Weblog and go to our Pinterest Labs website. To discover and apply to open roles, go to our Careers web page.