Higher video for cell RTC with AV1 and HD

Higher video for cell RTC with AV1 and HD
Higher video for cell RTC with AV1 and HD
  • At Meta, we help real-time communication (RTC) for billions of individuals by our apps, together with Messenger, Instagram, and WhatsApp.
  • We’ve seen vital advantages by adopting the AV1 codec for RTC.
  • Right here’s how we’re enhancing the RTC video high quality for our apps with instruments just like the AV1 codec, the challenges we face, and the way we mitigate these challenges.

The previous few a long time have seen great enhancements in cell phone digital camera high quality in addition to video high quality for streaming video companies. But when we have a look at real-time communication (RTC) purposes, whereas the video high quality additionally has improved over time, it has at all times lagged behind that of digital camera high quality. 

After we checked out methods to enhance video high quality for RTC throughout our household of apps, AV1 stood out as the best choice. Meta has more and more adopted the AV1 codec over time as a result of it affords excessive video high quality at bitrates a lot decrease than older codecs. However, as we’ve implemented AV1 for mobile RTC, we’ve additionally needed to tackle plenty of challenges together with scaling,  enhancing video high quality for low-bandwidth customers in addition to high-end networks, CPU and battery utilization, and sustaining high quality stability.

Bettering video high quality for low-bandwidth networks

This submit goes to deal with peer-to-peer (P2P, or 1:1) calls, which contain two contributors. 

Individuals who use our services and products expertise a variety of community circumstances – some have actually nice networks, whereas others are utilizing throttled or low-bandwidth networks.

This chart illustrates what the distribution of bandwidth appears like for a few of these calls on Messenger:

Determine 1: Bandwidth distribution of P2P calls on Messenger.

As seen in Determine 1, some calls function in very low-bandwidth circumstances. 

We think about something lower than 300 Kbps to be a low-end community, however we additionally see a number of video calls working at simply 50 Kbps, and even beneath 25 Kbps.

Notice that this bandwidth is the share for the video encoder. Whole bandwidth is shared with audio, RTP overhead, signaling overhead, RTX (re-transmissions of packets to deal with misplaced packets)/FEC (ahead error correction)/duplication (packet duplication), and so forth. The massive assumption right here is that the bandwidth estimator is working appropriately and estimating true bitrates. 

There aren’t any common definitions for low, mid, and excessive networks, however for the aim of this weblog submit, lower than 300 Kbps can be thought of as low, 300-800 Kbps as mid, and above 800 Kbps as a excessive, HD-capable, or high-end community.

After we appeared into enhancing the video high quality for low-bandwidth customers, there have been few key choices. Migrating to a more recent codec comparable to AV1 offered the best alternative. Different choices comparable to higher video scalers and region-of-interest encoding supplied incremental enhancements. 

Video scalers

We use WebRTC in most of our apps, however the video scalers shipped with WebRTC don’t have the very best quality video scaling. We’ve been in a position to enhance the video scaling high quality considerably by leveraging in-house scalers. 

At low bitrates, we regularly find yourself downscaling the video to encode at ¼ decision (assuming the digital camera seize is 640×480 or 1280×720). With our customized scaler implementations, we now have seen vital enhancements in video high quality. From public assessments we noticed positive aspects in peak signal-to-noise ratio (PSNR) by 0.75 db on common.

Here’s a snapshot exhibiting outcomes with the default libyuv scaler (a field filter):

Determine 2.a: Video picture outcomes utilizing WebRTC/libyuv video scaler.

And the outcomes after downscaling with our video scaler:

Determine 2.b: Video picture outcomes utilizing Meta’s video scaler.

Area-of-interest encoding

Figuring out the area of curiosity (ROI) allowed us to optimize by spending extra encoder bitrate within the space that’s most essential to a viewer (the speaker’s face in a speaking head video, for instance). Most cell units have APIs to find the face area with out using any CPU overhead. As soon as we now have discovered the face area we will configure the encoder to spend extra bits on this essential area and fewer on the remainder. The simplest means to do that was to have some APIs on encoders to configure the quantization parameters (QP) for ROI versus the remainder of the picture. These adjustments offered incremental enhancements within the video high quality metrics like PSNR. 

Adopting the AV1 video codec

The video encoder is a key ingredient in relation to video high quality for RTC. H.264 has been the preferred codec during the last decade, with {hardware} help and most purposes supporting it. However it’s a 20-year-old codec. Again in 2018, the Alliance for Open Media (AOMedia) standardized the AV1 video codec. Since then, a number of firms together with Meta, YouTube, and Netflix have deployed it at a large scale for video streaming

At Meta, transferring from H.264 to AV1 led us to our best enhancements in video high quality at low bitrates.

Determine 3: Enhancements over time, transferring from H.262 to AV1 and H.266

Why AV1?

We selected to make use of AV1 partly as a result of it’s royalty-free. Codec licensing (and concurrent charges) was an essential facet in our decision-making course of. Sometimes, if an software makes use of a tool’s {hardware} codec, no extra codec licensing prices can be incurred. But when an software is delivery a software program model of the codec,  there’ll most definitely be licensing prices to cowl.

However why do we have to use software program codecs though most telephones have hardware-supported codecs?

Most cell units have devoted {hardware} for video encoding and decoding. And as of late most cell units help H.264 and even H.265s. However these encoders are designed for widespread use instances comparable to digital camera seize, which makes use of a lot larger resolutions, body charges, and bitrates. Most cell gadget {hardware} is presently able to encoding 4K 60 FPS in actual time with very low battery utilization, however the outcomes of encoding a 7 FPS, 320×180, 200 Kbps video are sometimes worse than these of software program encoders operating on the identical cell gadget. 

The rationale for that’s prioritization of the RTC use case. Most unbiased {hardware} distributors (IHVs) are usually not conscious of the community circumstances the place RTC calls function; therefore, these {hardware} codecs are usually not optimized for RTC eventualities, particularly for low bitrates, resolutions, and body charges. So, we leverage software program encoders when working in these low bitrates to supply high-quality video.

And since we will’t ship software program codecs with no license, AV1 is an excellent possibility for RTC.

AV1 for RTC

The most important purpose to maneuver to a extra superior video codec is straightforward: The identical high quality expertise could be delivered with a a lot decrease bitrate, and we will ship a a lot higher-quality real-time calling expertise for our customers who’re on bandwidth-constrained networks.

Measuring video high quality is a fancy subject, however a comparatively easy means to have a look at it’s to make use of the Bjontegaard Delta-Bit Rate (BD-BR) metric. BD-BR compares how a lot bitrate varied codecs want to provide a sure high quality degree. By producing a number of samples at completely different bitrates, measuring the standard of the produced video offers a rate-distortion (RD) curve, and from the RD curve you may derive the BD-BR (as proven under).

As could be seen in Determine 4, AV1 offered larger high quality for all bitrate ranges in our native assessments.

Determine 4: Bitrate distortion comparability chart.

Display-encoding instruments

AV1 additionally has just a few key instruments which can be helpful for RTC. Display content material high quality is changing into an more and more essential issue for Meta, with related use instances, together with display screen sharing, recreation streaming, and VR distant desktop, requiring high-quality encoding. In these areas, AV1 actually shines. 

Historically, video encoders aren’t properly suited to advanced content material comparable to textual content with a number of high-frequency content material, and people are delicate to studying blurry textual content. AV1 has a set of coding instruments—palette mode and intra-block copy—that drastically enhance efficiency for display screen content material. Palette mode is designed in accordance with the commentary that the pixel values in a screen-content body often consider the restricted variety of coloration values. Palette mode can symbolize the display screen content material effectively by signaling the colour clusters as an alternative of the quantized transform-domain coefficients. As well as, for typical display screen content material, repetitive patterns can often be discovered throughout the similar image. Intra-block copy facilitates block prediction throughout the similar body, in order that the compression effectivity could be improved considerably. That AV1 offers these two instruments on the baseline profile is a large plus.

Reference image resampling: Fewer key frames

One other helpful characteristic is reference image resampling (RPR), which permits decision adjustments with out producing a key body. In video compression, a key body is one which’s encoded independently, like a nonetheless picture. It’s the one kind of body that may be decoded with out having one other body as reference. 

For RTC purposes, for the reason that bandwidth retains on altering typically, there are frequent decision adjustments wanted to adapt to those community adjustments. With older codecs like H.264, every of those decision adjustments requires a key body that’s a lot bigger in dimension and thus inefficient for RTC apps. Such giant key frames enhance the quantity of knowledge needing to be despatched over the community and lead to larger end-to-end latencies and congestion. 

Through the use of RPR, we will keep away from producing any key frames.

Challenges round enhancing video high quality for low-bandwidth customers

CPU/battery utilization

AV1 is nice for coding effectivity, however codecs obtain this at the price of larger CPU and battery utilization. A variety of trendy codecs pose these challenges when operating real-time purposes on cell platforms.

Primarily based on native lab testing, we anticipated a roughly 4 p.c enhance in battery utilization, and we noticed related leads to public assessments. We used an influence meter to do that native battery measurement.

Although the AV1 encoder itself elevated CPU utilization three-fold when in comparison with H.264 implementation, the general contribution of CPU utilization from the encoder was a small a part of the battery utilization. The cellphone show display screen, networking/radio, and different processes utilizing the CPU contribute considerably to battery utilization, therefore the rise in battery utilization was 5-6 p.c (a big enhance in battery utilization). 

A variety of calls run out of gadget battery, or individuals grasp up as soon as their working system signifies a low battery, so rising battery utilization isn’t worthwhile for customers until it offers elevated worth comparable to video high quality enchancment. Even then it’s a trade-off between video high quality versus battery use.

We use WebRTC and Session Description Protocol (SDP) for codec negotiation, which permits us to barter a number of codecs (e.g., AV1 and H.264) up entrance after which swap the codecs with none want for signaling or a handshake in the course of the name. This implies the codec swap is seamless, with out customers noticing any glitches or pauses in video.

We created a customized encoder that encapsulates each H.264 and the AV1 encoders. We name it a hybrid encoder. This allowed us to modify the codec in the course of the name primarily based on triggers comparable to CPU utilization, battery degree, or encoding time — and to modify to the extra battery-efficient H.264 encoder when wanted. 

Elevated crashes and out of reminiscence errors

Even with out new leaks added, AV1 used extra reminiscence than H.264. Any time extra reminiscence is used, apps usually tend to hit out of reminiscence (OOM) crashes or hit OOM sooner due to different leaks or reminiscence calls for on the system from different apps. To mitigate this, we needed to disable AV1 on units with low reminiscence. That is one space for enchancment and for additional optimizing the encoder’s reminiscence utilization.

In-product high quality measurement

To match the standard between H.264 and AV1 utilizing public assessments, we wanted a low-complexity metric. Metrics comparable to encoded bitrates and body charges gained’t present any positive aspects as the entire bandwidth accessible to ship video continues to be the identical, as these are restricted by the community capability, which implies the bitrates and body charges for video won’t change a lot with the change within the codec. We had been utilizing composite metrics that mix the quantization parameter (QP is commonly used as a proxy for video high quality, as this introduces pixel knowledge loss in the course of the encoding course of), resolutions, and body price, and freezes it to generate video composite metrics, however QP just isn’t comparable between AV1 and H.264 codecs, and therefore can’t be used.

PSNR is a regular metric, nevertheless it’s reference-based and therefore doesn’t work for RTC. Non-reference, video-quality metrics are fairly CPU-intensive (e.g., BRISQUE: Blind/Referenceless Picture Spatial High quality Evaluator), although we’re exploring these as properly.

Determine 6: Excessive-level structure for PSNR computation in RTC.

We’ve provide you with a framework for PSNR computation. We first modified the encoder to report distortions brought on by compression (most software program encoders have already got help for this metric). Then we designed a light-weight, scaling-distortion algorithm that estimates the distortion launched by video scaling. This algorithm can mix these scaling distortions with the encoder distortions to provide output PSNR. We developed and verified this algorithm regionally and can be sharing the findings in publications and at tutorial conferences over the subsequent yr. With this light-weight PSNR metric, we noticed 2 db enhancements with AV1 in comparison with H.264.

Challenges round enhancing video high quality for high-end networks

As a fast overview: For our functions, excessive bandwidth covers customers for whom bandwidth is bigger than 800 kbps. 

Over time, there have been enormous enhancements in digital camera seize high quality. Because of this, individuals’s expectations have gone up, and so they need to see RTC video high quality on par with native digital camera seize high quality. 

Primarily based on native testing, we settled on settings leading to video high quality that appears just like that of digital camera recordings. We name this HD mode. We discovered that with a video codec like H.264 encoding at 3.5 Mbps and 30 frames per second, 720p decision appeared similar to native digital camera recordings. We additionally in contrast 720p to 1080p in subjective high quality assessments and located that the distinction just isn’t noticeable on most units aside from these with a bigger display screen once we carried out subjective high quality assessments.

Bandwidth estimator enhancements

Bettering the video high quality for customers who’ve high-end telephones with good CPUs, good batteries, {hardware} codecs, and good community speeds appears trivial. It might look like all it’s a must to do is enhance the utmost bitrate, seize decision, and seize body charges, and customers will ship high-quality video. However, in actuality, it’s not that straightforward. 

If you happen to enhance the bitrate, you expose your bandwidth estimation and congestion detection algorithm to hit congestion extra typically, and your algorithm can be examined many extra instances than if you weren’t utilizing these larger bitrates. 

Determine 7: Instance exhibiting how utilizing larger bandwidth will increase the cases for congestion.

If you happen to have a look at the community pipeline in Determine 7, the upper the bitrates you’re utilizing, the extra your algorithm/code can be examined for robustness over the time of the RTC name. Determine 7 exhibits how utilizing 1 Mbps hits extra congestion than utilizing 500 Kbps and utilizing 3 Mbps hits extra congestion than 1 Mbps, and so forth. In case you are utilizing bandwidths decrease than the minimal throughput of the community, nonetheless, you gained’t hit congestion in any respect. For instance, see the 500-Kbps name in Determine 7. 

To mitigate these points, we improved congestion detection. For instance, we added customized ISP throttling detection, one thing that was not being caught by the standard delay-based estimator of WebRTC. 

Bandwidth estimator and community resilience comprise a fancy space on their very own, and that is the place RTC merchandise stand out. They’ve their very own customized algorithms that work finest for his or her merchandise and prospects.

Steady high quality

Individuals don’t like oscillations in video high quality. These can occur once we ship high-quality video for just a few seconds after which drop again to low-quality due to congestion. Studying from previous historical past, we added help in  bandwidth estimation to forestall these oscillations.

Audio is extra essential than video for RTC

When community congestion happens, all media packets may very well be misplaced. This causes video freezes and damaged audio, (aka, robotic audio). For RTC, each are unhealthy, however audio high quality is extra essential than video. 

Damaged audio typically fully prevents conversations from taking place, typically inflicting individuals to hold up or redial the decision. Damaged video, then again, typically leads to much less pleasant conversations, however, relying on the state of affairs, it is also a block for some customers.

At excessive bitrates like 2.5 Mbps and better, you may afford to have three to 5 instances extra audio packets or duplication with none noticeable degradation to video. When working in these larger bitrates with cellular phone connections, we noticed extra of those congestion, packet loss, and ISP throttling points, so we needed to make adjustments to our community resiliency algorithms. And since individuals are extremely delicate to knowledge utilization on their cell telephones, we disabled excessive bitrates on mobile connections.

When to allow HD?

We used ML-based concentrating on to guess which name ought to be HD-capable. We relied on the community stats from the customers’ earlier calls to foretell if HD ought to be enabled or not.

Battery regressions

We’ve numerous metrics, together with efficiency, networking, and media high quality, to trace the standard of RTC calls. After we ran assessments for HD, we observed regressions in battery metrics. What we discovered was that almost all battery regressions don’t come from larger bitrates or decision however from the seize body charges.

To mitigate the regressions, we constructed a mechanism for detecting each caller and callee gadget capabilities, together with gadget mannequin, battery ranges, Wi-Fi or cell utilization, and so forth. To allow high-quality modes, we examine each side of the decision to make sure that they fulfill the necessities and solely then can we allow these high-quality, resource-intensive configurations.

Determine 8: Signaling server setup for turning HD on or off.

What the longer term holds for RTC

{Hardware} producers are acknowledging the numerous advantages of utilizing AV1 for RTC. The brand new Apple iPhone 15 Professional helps AV1’s {hardware} decoder, and the Google Pixel 8 helps AV1 encoding and decoding. {Hardware} codecs are an absolute necessity for high-end community and HD resolutions. Video calling is changing into as ubiquitous as conventional audio calling and we hope that as {hardware} producers acknowledge this shift, there can be extra alternatives for collaboration between RTC app creators and {hardware} producers to optimize encoders for these eventualities. 

On the software program facet, we are going to proceed to work on optimizing AV1 software program encoders and growing new encoder implementations. We attempt to present the very best expertise for our customers, however on the similar time we need to let individuals have full management over their RTC expertise. We’ll present controls to the customers in order that they will select whether or not they need larger high quality at the price of battery and knowledge utilization, or vice versa.

We additionally plan to work with IHVs to collaborate on {hardware} codec growth to make these codecs usable for RTC eventualities together with low-bandwidth use instances. 

We additionally will examine forward-looking options comparable to video processing to extend the decision and body charges on the receiver’s rendering stack and leveraging AI/ML to enhance bandwidth estimation (BWE) and community resiliency.

Additional, we’re investigating Pixel Codec Avatar applied sciences that may permit us to transmit the mannequin/share as soon as after which ship the geometry/vectors for receiver facet rendering. This allows video rendering with a lot smaller bandwidth utilization than conventional video codecs for RTC eventualities.