Noisy Neighbor Detection with eBPF | by Netflix Know-how Weblog | Sep, 2024

The sched_wakeup and sched_wakeup_new hooks are invoked when a course of modifications state from ‘sleeping’ to ‘runnable.’ They allow us to determine when a course of is able to run and is ready for CPU time. Throughout this occasion, we generate a timestamp and retailer it in an eBPF hash map utilizing the method ID as the important thing.

struct 
__uint(sort, BPF_MAP_TYPE_HASH);
__uint(max_entries, MAX_TASK_ENTRIES);
__uint(key_size, sizeof(u32));
__uint(value_size, sizeof(u64));
runq_enqueued SEC(".maps");

SEC("tp_btf/sched_wakeup")
int tp_sched_wakeup(u64 *ctx)

struct task_struct *activity = (void *)ctx[0];
u32 pid = task->pid;
u64 ts = bpf_ktime_get_ns();

bpf_map_update_elem(&runq_enqueued, &pid, &ts, BPF_NOEXIST);
return 0;

Conversely, the sched_switch hook is triggered when the CPU switches between processes. This hook gives tips to the method at present using the CPU and the method about to take over. We use the upcoming activity’s course of ID (PID) to fetch the timestamp from the eBPF map. This timestamp represents when the method entered the queue, which we had beforehand saved. We then calculate the run queue latency by merely subtracting the timestamps.

SEC("tp_btf/sched_switch")
int tp_sched_switch(u64 *ctx)
{
struct task_struct *prev = (struct task_struct *)ctx[1];
struct task_struct *subsequent = (struct task_struct *)ctx[2];
u32 prev_pid = prev->pid;
u32 next_pid = next->pid;

// fetch timestamp of when the following activity was enqueued
u64 *tsp = bpf_map_lookup_elem(&runq_enqueued, &next_pid);
if (tsp == NULL)
return 0; // missed enqueue

// calculate runq latency earlier than deleting the saved timestamp
u64 now = bpf_ktime_get_ns();
u64 runq_lat = now - *tsp;

// delete pid from enqueued map
bpf_map_delete_elem(&runq_enqueued, &next_pid);
....

One of many benefits of eBPF is its skill to supply tips to the precise kernel information constructions representing processes or threads, often known as duties in kernel terminology. This function allows entry to a wealth of knowledge saved a few course of. We required the method’s cgroup ID to affiliate it with a container for our particular use case. Nonetheless, the cgroup info within the course of struct is safeguarded by an RCU (Read Copy Update) lock.

To soundly entry this RCU-protected info, we will leverage kfuncs in eBPF. kfuncs are kernel capabilities that may be referred to as from eBPF applications. There are kfuncs out there to lock and unlock RCU read-side crucial sections. These capabilities be certain that our eBPF program stays protected and environment friendly whereas retrieving the cgroup ID from the duty struct.

void bpf_rcu_read_lock(void) __ksym;
void bpf_rcu_read_unlock(void) __ksym;

u64 get_task_cgroup_id(struct task_struct *activity)

struct css_set *cgroups;
u64 cgroup_id;
bpf_rcu_read_lock();
cgroups = task->cgroups;
cgroup_id = cgroups->dfl_cgrp->kn->id;
bpf_rcu_read_unlock();
return cgroup_id;

As soon as the information is prepared, we should package deal it and ship it to userspace. For this objective, we selected the eBPF ring buffer. It’s environment friendly, high-performing, and user-friendly. It could deal with variable-length information data and permits information studying with out necessitating additional reminiscence copying or syscalls. Nonetheless, the sheer variety of information factors was inflicting the userspace program to make use of an excessive amount of CPU, so we carried out a charge limiter in eBPF to pattern the information.

struct 
__uint(sort, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, RINGBUF_SIZE_BYTES);
occasions SEC(".maps");

struct
__uint(sort, BPF_MAP_TYPE_PERCPU_HASH);
__uint(max_entries, MAX_TASK_ENTRIES);
__uint(key_size, sizeof(u64));
__uint(value_size, sizeof(u64));
cgroup_id_to_last_event_ts SEC(".maps");

struct runq_event
u64 prev_cgroup_id;
u64 cgroup_id;
u64 runq_lat;
u64 ts;
;

SEC("tp_btf/sched_switch")
int tp_sched_switch(u64 *ctx)

// ....
// The earlier code
// ....

u64 prev_cgroup_id = get_task_cgroup_id(prev);
u64 cgroup_id = get_task_cgroup_id(subsequent);

// per-cgroup-id-per-CPU rate-limiting
// to steadiness observability with efficiency overhead
u64 *last_ts =
bpf_map_lookup_elem(&cgroup_id_to_last_event_ts, &cgroup_id);
u64 last_ts_val = last_ts == NULL ? 0 : *last_ts;

// test the speed restrict for the cgroup_id in consideration
// earlier than doing extra work
if (now - last_ts_val < RATE_LIMIT_NS)
// Fee restrict exceeded, drop the occasion
return 0;

struct runq_event *occasion;
occasion = bpf_ringbuf_reserve(&occasions, sizeof(*occasion), 0);

if (occasion)
event->prev_cgroup_id = prev_cgroup_id;
event->cgroup_id = cgroup_id;
event->runq_lat = runq_lat;
event->ts = now;
bpf_ringbuf_submit(occasion, 0);
// Replace the final occasion timestamp for the present cgroup_id
bpf_map_update_elem(&cgroup_id_to_last_event_ts, &cgroup_id,
&now, BPF_ANY);

return 0;

Our userspace software, developed in Go, processes occasions from the ring buffer to emit metrics to our metrics backend, Atlas. Every occasion features a run queue latency pattern with a cgroup ID, which we affiliate with containers working on the host. We categorize it as a system service if no such affiliation is discovered. When a cgroup ID is related to a container, we emit a percentile timer Atlas metric (runq.latency) for that container. We additionally increment a counter metric (sched.change.out) to observe preemptions occurring for the container’s processes. Entry to the prev_cgroup_id of the preempted course of permits us to tag the metric with the reason for the preemption, whether or not it is as a consequence of a course of throughout the identical container (or cgroup), a course of in one other container, or a system service.

It is necessary to focus on that each the runq.latency metric and the sched.change.out metrics are wanted to find out if a container is affected by noisy neighbors, which is the objective we intention to realize — relying solely on the runq.latency metric can result in misconceptions. For instance, if a container is at or over its cgroup CPU restrict, the scheduler will throttle it, leading to an obvious spike in run queue latency as a consequence of delays within the queue. If we have been solely to contemplate this metric, we would incorrectly attribute the efficiency degradation to noisy neighbors when it is usually because the container is hitting its CPU quota. Nonetheless, simultaneous spikes in each metrics, primarily when the trigger is a special container or system course of, clearly point out a loud neighbor subject.

Beneath is the runq.latency metric for a server working a single container with ample CPU capability. The 99th percentile averages 83.4µs (microseconds), serving as our baseline. Though there are some spikes reaching 400µs, the latency stays inside acceptable parameters.