Optimizing RTC bandwidth estimation with machine studying

  • Bandwidth estimation (BWE) and congestion management play an necessary function in delivering high-quality real-time communication (RTC) throughout Meta’s household of apps.
  • We’ve adopted a machine studying (ML)-based strategy that enables us to resolve networking issues holistically throughout cross-layers similar to BWE, community resiliency, and transport.
  • We’re sharing our experiment outcomes from this strategy, among the challenges we encountered throughout execution, and learnings for brand spanking new adopters.

Our present bandwidth estimation (BWE) module at Meta is based on WebRTC’s Google Congestion Controller (GCC). Now we have made a number of enhancements by means of parameter tuning, however this has resulted in a extra complicated system, as proven in Determine 1.

Determine 1: BWE module’s system diagram for congestion management in RTC.

One problem with the tuned congestion management (CC)/BWE algorithm was that it had a number of parameters and actions that had been depending on community circumstances. For instance, there was a trade-off between high quality and reliability; bettering high quality for high-bandwidth customers typically led to reliability regressions for low-bandwidth customers, and vice versa, making it difficult to optimize the consumer expertise for various community circumstances.

Moreover, we seen some inefficiencies with regard to bettering and sustaining the module with the complicated BWE module:

  1. As a result of absence of real looking community circumstances throughout our experimentation course of, fine-tuning the parameters for consumer purchasers necessitated a number of makes an attempt.
  2. Even after the rollout, it wasn’t clear if the optimized parameters had been nonetheless relevant for the focused community sorts.
  3. This resulted in complicated code logics and branches for engineers to take care of.

To resolve these inefficiencies, we developed a machine studying (ML)-based, network-targeting strategy that provides a cleaner various to hand-tuned guidelines. This strategy additionally permits us to resolve networking issues holistically throughout cross-layers similar to BWE, community resiliency, and transport.

Community characterization

An ML model-based strategy leverages time sequence knowledge to enhance the bandwidth estimation through the use of offline parameter tuning for characterised community sorts. 

For an RTC name to be accomplished, the endpoints have to be related to one another by means of community gadgets. The optimum configs which have been tuned offline are saved on the server and will be up to date in real-time. Throughout the name connection setup, these optimum configs are delivered to the shopper. Throughout the name, media is transferred straight between the endpoints or by means of a relay server. Relying on the community indicators collected throughout the name, an ML-based strategy characterizes the community into differing kinds and applies the optimum configs for the detected kind.

Determine 2 illustrates an instance of an RTC name that’s optimized utilizing the ML-based strategy.  

Determine 2: An instance RTC name configuration with optimized parameters delivered from the server and based mostly on the present community kind.

Mannequin studying and offline parameter tuning

On a excessive stage, community characterization consists of two foremost elements, as proven in Determine 3. The primary part is offline ML mannequin studying utilizing ML to categorize the community kind (random packet loss versus bursty loss). The second part makes use of offline simulations to tune parameters optimally for the categorized community kind. 

Determine 3: Offline ML-model studying and parameter tuning.

For mannequin studying, we leverage the time sequence knowledge (community indicators and non-personally identifiable info, see Determine 6, beneath) from manufacturing calls and simulations. In comparison with the combination metrics logged after the decision, time sequence captures the time-varying nature of the community and dynamics. We use FBLearner, our inner AI stack, for the coaching pipeline and ship the PyTorch mannequin recordsdata on demand to the purchasers at the beginning of the decision.

For offline tuning, we use simulations to run community profiles for the detected sorts and select the optimum parameters for the modules based mostly on enhancements in technical metrics (similar to high quality, freeze, and so forth.).

Mannequin structure

From our expertise, we’ve discovered that it’s vital to mix time sequence options with non-time sequence (i.e., derived metrics from the time window) for a extremely correct modeling.

To deal with each time sequence and non-time sequence knowledge, we’ve designed a mannequin structure that may course of enter from each sources.

The time sequence knowledge will go by means of a long short-term memory (LSTM) layer that may convert time sequence enter right into a one-dimensional vector illustration, similar to 16×1. The non-time sequence knowledge or dense knowledge will go by means of a dense layer (i.e., a totally related layer). Then the 2 vectors can be concatenated, to completely symbolize the community situation previously, and handed by means of a totally related layer once more. The ultimate output from the neural community mannequin would be the predicted output of the goal/process, as proven in Determine 4. 

Determine 4: Mixed-model structure with LSTM and Dense Layers

Use case: Random packet loss classification

Let’s think about the use case of categorizing packet loss as both random or congestion. The previous loss is as a result of community elements, and the latter is as a result of limits in queue size (that are delay dependent). Right here is the ML process definition:

Given the community circumstances previously N seconds (10), and that the community is presently incurring packet loss, the objective is to characterize the packet loss on the present timestamp as RANDOM or not.

Determine 5 illustrates how we leverage the structure to attain that objective:

Determine 5: Mannequin structure for a random packet loss classification process.

Time sequence options

We leverage the next time sequence options gathered from logs:

Determine 6: Time sequence options used for mannequin coaching.

BWE optimization

When the ML mannequin detects random packet loss, we carry out native optimization on the BWE module by:

  • Rising the tolerance to random packet loss within the loss-based BWE (holding the bitrate).
  • Rising the ramp-up pace, relying on the hyperlink capability on excessive bandwidths.
  • Rising the community resiliency by sending further forward-error correction packets to get better from packet loss.

Community prediction

The community characterization drawback mentioned within the earlier sections focuses on classifying community sorts based mostly on previous info utilizing time sequence knowledge. For these easy classification duties, we obtain this utilizing the hand-tuned guidelines with some limitations. The true energy of leveraging ML for networking, nevertheless, comes from utilizing it for predicting future community circumstances.

Now we have utilized ML for fixing congestion-prediction issues for optimizing low-bandwidth customers’ expertise.

Congestion prediction

From our evaluation of manufacturing knowledge, we discovered that low-bandwidth customers typically incur congestion as a result of habits of the GCC module. By predicting this congestion, we are able to enhance the reliability of such customers’ habits. In the direction of this, we addressed the next drawback assertion utilizing round-trip time (RTT) and packet loss:

Given the historic time-series knowledge from manufacturing/simulation (“N” seconds), the objective is to foretell packet loss as a result of congestion or the congestion itself within the subsequent “N” seconds; that’s, a spike in RTT adopted by a packet loss or an extra development in RTT.

Determine 7 exhibits an instance from a simulation the place the bandwidth alternates between 500 Kbps and 100 Kbps each 30 seconds. As we decrease the bandwidth, the community incurs congestion and the ML mannequin predictions fireplace the inexperienced spikes even earlier than the delay spikes and packet loss happen. This early prediction of congestion is useful in quicker reactions and thus improves the consumer expertise by stopping video freezes and connection drops.

Determine 7: Simulated community situation with alternating bandwidth for congestion prediction

Producing coaching samples

The principle problem in modeling is producing coaching samples for quite a lot of congestion conditions. With simulations, it’s tougher to seize various kinds of congestion that actual consumer purchasers would encounter in manufacturing networks. Because of this, we used precise manufacturing logs for labeling congestion samples, following the RTT-spikes standards previously and future home windows based on the next assumptions:

  • Absent previous RTT spikes, packet losses previously and future are unbiased.
  • Absent previous RTT spikes, we can’t predict future RTT spikes or fractional losses (i.e., flosses).

We break up the time window into previous (4 seconds) and future (4 seconds) for labeling.

Determine 8: Labeling standards for congestion prediction

Mannequin efficiency

In contrast to community characterization, the place floor fact is unavailable, we are able to acquire floor fact by inspecting the longer term time window after it has handed after which evaluating it with the prediction made 4 seconds earlier. With this logging info gathered from actual manufacturing purchasers, we in contrast the efficiency in offline coaching to on-line knowledge from consumer purchasers:

Determine 9: Offline versus on-line mannequin efficiency comparability.

Experiment outcomes

Listed here are some highlights from our deployment of assorted ML fashions to enhance bandwidth estimation:

Reliability wins for congestion prediction

connection_drop_rate -0.326371 +/- 0.216084
✅ last_minute_quality_regression_v1 -0.421602 +/- 0.206063
✅ last_minute_quality_regression_v2 -0.371398 +/- 0.196064
✅ bad_experience_percentage -0.230152 +/- 0.148308
✅ transport_not_ready_pct -0.437294 +/- 0.400812

peer_video_freeze_percentage -0.749419 +/- 0.180661
✅ peer_video_freeze_percentage_above_500ms -0.438967 +/- 0.212394

High quality and consumer engagement wins for random packet loss characterization in excessive bandwidth

peer_video_freeze_percentage -0.379246 +/- 0.124718
✅ peer_video_freeze_percentage_above_500ms -0.541780 +/- 0.141212
✅ peer_neteq_plc_cng_perc -0.242295 +/- 0.137200

✅ total_talk_time 0.154204 +/- 0.148788

Reliability and high quality wins for mobile low bandwidth classification

✅ connection_drop_rate -0.195908 +/- 0.127956
✅ last_minute_quality_regression_v1 -0.198618 +/- 0.124958
✅ last_minute_quality_regression_v2 -0.188115 +/- 0.138033

✅ peer_neteq_plc_cng_perc -0.359957 +/- 0.191557
✅ peer_video_freeze_percentage -0.653212 +/- 0.142822

Reliability and high quality wins for mobile excessive bandwidth classification

✅ avg_sender_video_encode_fps 0.152003 +/- 0.046807
✅ avg_sender_video_qp -0.228167 +/- 0.041793
✅ avg_video_quality_score 0.296694 +/- 0.043079
✅ avg_video_sent_bitrate 0.430266 +/- 0.092045

Future plans for making use of ML to RTC

From our venture execution and experimentation on manufacturing purchasers, we seen {that a} ML-based strategy is extra environment friendly in focusing on, end-to-end monitoring, and updating than conventional hand-tuned guidelines for networking. Nevertheless, the effectivity of ML options largely will depend on knowledge high quality and labeling (utilizing simulations or manufacturing logs). By making use of ML-based options to fixing community prediction issues – congestion particularly – we totally leveraged the facility of ML. 

Sooner or later, we can be consolidating all of the community characterization fashions right into a single mannequin utilizing the multi-task strategy to repair the inefficiency as a result of redundancy in mannequin obtain, inference, and so forth. We can be constructing a shared illustration mannequin for the time sequence to resolve totally different duties (e.g., bandwidth classification, packet loss classification, and so forth.) in community characterization. We’ll deal with constructing real looking manufacturing community situations for mannequin coaching and validation. This can allow us to make use of ML to determine optimum community actions given the community circumstances. We’ll persist in refining our learning-based strategies to reinforce community efficiency by contemplating present community indicators.