Deep Multi-task Studying and Actual-time Personalization for Closeup Suggestions | by Pinterest Engineering | Pinterest Engineering Weblog | Jun, 2023

Pinterest Engineering
Pinterest Engineering Blog

Haomiao Li | Software program Engineer, Closeup Rating & Mixing; Travis Ebesu | Software program Engineer, Closeup Rating & Mixing; Fan Jiang | Software program Engineer, Closeup Candidates; Jay Adams | Software program Engineer, Pinner Progress & Alerts; Olafur Gudmundsson | Software program Engineer, Pinner Discovery; Yan Solar | Engineering Supervisor, Closeup Rating & Mixing; Huizhong Duan | Engineering Supervisor, Closeup Relevance

This Figure 1 shows the ranking model architecture from input features, to middle layer structures and the final prediction output.

At Pinterest, Closeup suggestions (aka Associated Pins) is often a feed of really helpful content material (primarily Pins) that we serve on any pin closeup. Closeup suggestions generate the most important quantity of impressions amongst all advice surfaces at Pinterest and are uniquely vital for our customers’ inspiration-to-realization journey. It’s necessary that we floor qualitative, related, and context-and-user-aware suggestions for folks on Pinterest.

To attain our targets of person engagement and satisfaction, the Closeup relevance staff has been innovating and making use of state-of-the-art machine studying strategies. Particularly, we have now designed deep neural community (DNN) fashions that deeply embed multi-task predictions for person outcomes. We’ve launched sequential options that seize a person’s most up-to-date actions, in addition to employed a personalised, context-aware mixing mannequin that mixes all predictions into last rating in real-time. On this weblog put up, we are going to contact on:

  • How we obtained began on multi-task prediction
  • How we additional improved multi-task prediction in our DNN structure utilizing Multi-gate Combination of Consultants (MMoE)
  • How we launched teacher-student regularization to stabilize rating mannequin predictions
  • How we included normal person indicators in addition to real-time person sequence indicators to seize customers’ long run and quick time period curiosity
  • How we leveraged utility mixing to additional mannequin customers’ real-time, query-specific preferences

The Closeup “rating” mannequin is considerably of a misnomer in the present day. When it was first launched, it was meant to be the one mannequin that determines the rating of suggestions for Closeup suggestions. Since then, the mannequin itself, in addition to its utilization, has advanced so much. Some noteworthy adjustments embody using xgboost mannequin, transition to DNN, adoption of AutoML¹ , however most notably, switching from single output to multi-task prediction. On this new paradigm, the “rating” mannequin now not straight determines the ultimate order for the suggestions; somewhat, it outputs the probability for various actions a person could take, together with closing up, repin, click on, and so on. This has led to important flexibility in optimization in addition to important enchancment within the prediction high quality. Nonetheless, we would have liked to “deepen” the multi-task modeling additional into our DNN structure via MMOE, in order that we unleashed the potential of multi-task modeling, the place every professional/process shared learnings to the utmost extent. Determine 1 is a fast view of our general DNN structure.

This Figure 1 shows the ranking model architecture from input features, to middle layer structures and the final prediction output.
Determine 1: Mannequin Structure for Closeup Feed Rating Mannequin

The Closeup rating mannequin consists of an inventory of main parts as proven in Determine 1 together with:

  • Illustration Layers: pre-processes various kinds of options (embedding desk lookup for categorical options, log transformation, and normalization for steady options, and so on.)
    – One spotlight is that we employed a transformer encoder (proven in Determine 2) to preprocess person sequence indicators, context options, and candidate Pin options:
    Person’s most up-to-date 100 engagement actions (repin, closeup, disguise, and so on.)
    Person’s most up-to-date 100 engaged pins’ pinSage embeddings
    Context indicators comparable to question Pin embeddings and Pinner embeddings
    Candidate Pin embeddings
This Figure 2 shows how we preprocess user sequence signals, context features, and candidate Pin features via the transformer encoder.
Determine 2: Transformer Encoder for Person Sequence Alerts Preprocessing
  • Summarization layer: teams options which might be comparable collectively (i.e. person annotations from completely different sources comparable to search queries, board, and so on.) right into a single function by passing via a MLP, representing every function group in a decrease dimensional latent house
  • Transformer mixer: performs self-attention over teams of options
  • MMoE: combines the outcomes of impartial “specialists” to provide predictions for every process

Beneath we are going to spotlight a number of the parts in extra element.

Multi-task Predictions

The duties that the mannequin is attempting to foretell are repin, closeup, clicks, and long-clicks. The mannequin realized the chance via a binary entropy loss for every process, and the loss is averaged per batch throughout every coaching step. Presently the loss weight for every process is equal, however throughout the knowledge preparation stage, we apply numerous weight changes so that every coaching instance is correctly represented within the loss operate. The loss operate is captured under, the place b = (1, … B) from B examples within the batch, and h = (1, … H) from H duties.

Rating Regularization

Up to now, we encountered mannequin instability the place predictions throughout two fashions with the identical configuration range considerably resulting in an inconsistent person expertise from pointless permutations in rating order. Subsequently we launched rating regularization⁴ (formulation is proven in Determine 3) to distill data from the trainer mannequin (the earlier manufacturing mannequin) and stabilize mannequin predictions distribution. The inference for the trainer mannequin is run throughout scholar mannequin coaching, and we add the regularization time period to complete loss and tuned the coefficient 𝜆 to manage the burden of this regularization time period.

Figure 3 shows how to distill knowledge from the teacher model and stabilize model predictions distribution for the student model.
Determine 3: Formulation of Rating Regularization

Multi-gate Combination of Consultants (MMOE)

MMoE was initially proposed on this paper² and demonstrated the flexibility to explicitly be taught process relationships from knowledge versus the normal shared-bottom mannequin construction. The instinct is that in a share-bottom construction, mannequin parameters are tightly shared amongst duties, the place inherent battle among the many duties can hurt the predictions for a number of duties.

An MMoE module consists of a number of MLP specialists and a number of corresponding softmax gates. Every professional on this module is a MLP that makes a speciality of studying specialised process representations, and the corresponding gate will be taught the weights for every professional’s process output. Then the ultimate output is a weighted sum of the outputs from the specialists and gates, handed via a linear transformation. The location for the MMoE module is proven in Determine 4 under:

Figure 4 shows how to use multiple MLP experts in MMOE with multiple corresponding softmax gates for different tasks.
Determine 4: Combination of Consultants Structure

Some implementation particulars embody:

  • Concatenating transformer mixer output to professional output: this concept is just like ResNet, the place we not solely go the output from the transformer mixer because the enter of the specialists and gates, but in addition concatenate it to the output of the specialists. This helps to protect the total info from the transformer mixer and additional boosts mannequin efficiency.
  • Making use of 20% dropout in professional layers helps to keep away from mannequin overfitting
  • Intensive parameter tuning to seek out the optimum set of hyperparameters: we carried out a grid search on three hyperparameters [num_experts, expert_hidden_sizes, tfmr_output_dim]. From the tuning, we realized that:
    – Inside an affordable vary, the extra specialists we use, the higher the mannequin performs offline. However with a view to be certain the specialists should not under-utilized, we produced Determine 5 under to visualise how every professional is specialised at modeling duties.
Figure 5 shows how each expert is specialized at modeling tasks (repin, closeup, click and long click)
Determine 5: Plot Common Weights From Gates Output

— Less complicated professional module performs higher than wider or deeper specialists, i.e., [256, 256] provides higher efficiency than [512, 512] or [256, 256, 256]. This may very well be as a result of we have already got a comparatively giant variety of specialists, so the specialists don’t must be complicated.

Right here we present some offline and on-line outcomes for making use of the MMoE to rating mannequin:

  • Offline Analysis: as proven in Desk 6, for the closeup floor, we purpose at enhance the HIT@3 and AUC for the 4 actions: repin (most necessary one), closeup, click on and long-click as talked about in Determine 1
Hits @ 3 ROC_AUC click closeup long_click repin click closeup long_click repin MMoE +2.61% +1.58% +3.09% +1.11% +0.59% +1.31% +0.77% +0.26%
Desk 6: Offline Analysis Metrics (relative change to baseline)
  • On-line Experiment Outcomes: as proven in Desk 7, for on-line A/B experiment, we noticed that for general customers and P5 nations (US, UK, CA, FR and DE) customers, the repin quantity elevated by 4% and closeup quantity elevated by 1%, aligning with the offline analysis.
Closeup surface Total Repin Volume Total Closeup Volume All Countries P5 Countries All Countries P5 Countries MMoE +4% +3~4% +1% +1%
Desk 7: On-line Experiment Metrics

After the rating layer predictions, we make use of a mixing layer the place the order of Pin suggestions is decided. Right here, we launched one other ML mannequin, which builds upon the multi-objective optimization framework and leverages the person and question Pin options to make real-time selections on what to prioritize and the way a lot we wish to optimize them, with a view to greatest serve customers’ wants in addition to to accommodate numerous enterprise necessities. Presently, the layer supplies a very good stability between the natural content material, which optimizes for natural engagements, and procuring content material, which optimizes for procuring conversion.

The natural content material goal is at present represented as a weighted sum between hand-tuned coefficients and every process’s prediction by the rating mannequin because of its Pareto optimality. Traditionally, the staff has been utilizing Bayesian optimization strategies to tune the mixing weights via on-line experiments. However this generic strategy lacks robustness as we have to tune the weights every time the rating mannequin rating distribution shifts, and the suggestions loop is lengthy. Subsequently, we launched a model-based strategy to be taught personalised weights, which we name Realized Utility.

Realized Utility Mannequin

We formulate studying these optimum blender parameters (coverage) into an offline supervised studying setting. For a slice of customers, we randomly range their blender parameters and log the corresponding end result. Subsequent, we outline a reward operate which assigns a price to the corresponding engagement we noticed (e.g. closeup reward = 1 and conceal reward = -2). Then we be taught a mannequin that predicts the anticipated reward for a given request. We use a mannequin that may be factored allowing entry to the realized optimum blender parameters as proven in Determine 8. At serving time, we use solely the a part of the mannequin that predicts the optimum blender parameters as proven in Determine 9.

Figure 8 shows how to use logged user features and blender parameters for model training.
Determine 8: Coaching strategy of the Realized Utility Mannequin
Figure 9 shows how to do serving inference using the online user features.
Determine 9: Serving the Realized Utility mannequin

Extra formally, Realized Utility makes an attempt to discover a set of mixing parameters w₁, … , wₙ that optimizes a given reward, R. We will formulate this as a binary classification process with a reward weighted cross-entropy loss denoted as R * l(g(x, r), y). Every coaching occasion is comprised of (R, x, r, y) , the place person, context and question stage options denoted as x; r the randomized blender parameters that led to the person’s engagement habits y ensuing within the reward R and our mannequin g(x, r). Our mannequin is parameterized by way of a multi-layer perceptron f(x) = w₁…. wₙ. To calculate the reward of the anticipated blender parameters we compute the internal product with the randomized blender parameters, ie g(x, r) = (rᵀf(x) + b), the place b is a learnable international bias and is the logistic sigmoid operate. This formulation permits us to factorize the mannequin g(.) and acquire our desired blender parameters f(.).

Noise launched throughout the assortment of the randomized logging coverage makes it tough for the mannequin to correctly be taught a very good set of parameters. Subsequently we place informative Gaussian priors on our blender parameters wᵢ ~N(sᵢ, σᵢ²) the place the sᵢ denotes the iᵗʰ identified manufacturing parameter and a hyperparameter σ² to manage the variance. Performing an MAP estimation will give us an equal L2 regularizer resulting in our last goal

the place we simplify 𝜆ᵢ= 1/2σᵢ² and in experiments we use a world 𝜆 = 2 .

On-line Experiment Outcomes

The outcomes proven under come from our on-line A/B experiment for the closeup stream floor rating and mixing stage. That is the stream expertise triggered when a person closes up on a natively printed video Pin³. The important thing metrics for this floor are 10s full display screen view (FSV), period and time spent, and from Desk 10, we have now seen important enhancements in these metrics.

10s FSV Total Duration Reactions Engaged Stream Sessions +6.97% +2~4% +4~9% +1~2%
Desk 10: On-line Experiment Outcomes for Realized Utility

Our work of adopting and innovating upon multi-task studying with superior options and state-of-art mannequin structure within the Closeup advice system has successfully improved high quality of content material and led to important advantages to pinners’ engagements.

As for subsequent steps, we’re working with cross staff efforts on:

  • Adopting a richer and longer actual time person sequence sign
  • Enhancing GPU mannequin serving efficiency
  • Mannequin structure iterations
  • Adoption of realized utility in different surfaces comparable to Homefeed

This work represents a results of collaboration throughout a number of groups at Pinterest.

And plenty of because of the next those that contributed to this work:

Closeup staff: Minzhen Yi , Bo Fu, Chen Chen

ATG staff: Yi-Ping Hsu, Paul Baltecsu, Pong Eksombatchai, Jiajing Xu

ML Platform staff: Nazanin Farahpour, Se Received Jang, Zhiyuan Zhang

Person Sequence Help staff: Zefan Fu, Shun-ping Chiu, Jisong Liu, Yitong Zhou,Jiacheng Hong

Homefeed staff: Yaron Greif, Ruimin Zhu

Core Serving Infra staff: Kent Jiang,Zheng Liu

Search staff: Cosmin Negruseri

¹E. Wang, How we use AutoML, Multi-task studying and Multi-tower fashions for Pinterest Advertisements

²J. Ma, and so on “Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts”, KDD 2018, August 19–23, 2018

³“Pinterest introduces Concept Pins globally and launches new creator discovery options”

https://newsroom.pinterest.com/en/post/pinterest-introduces-idea-pins-globally-and-launches-new-creator-discovery-features

⁴R. Li, et al “Stabilizing Neural Search Ranking Models”, WWW 2020

To be taught extra about engineering at Pinterest, take a look at the remainder of our Engineering Weblog and go to our Pinterest Labs web site. To discover life at Pinterest, go to our Careers web page.