Sidecar-less Kubernetes: Zero-Overhead gRPC Observability using eBPF on ACK


When architecting backend services for an international POS system or any globally distributed transaction engine, latency directly impacts revenue. You are pushing 100,000+ requests per second (RPS) of multiplexed gRPC traffic through your clusters. At this scale, the traditional service mesh architecture—specifically the Envoy or Istio sidecar model—transitions from an operational convenience into a critical bottleneck.

For Principal SREs and Platform Engineers running on Alibaba Cloud Container Service for Kubernetes (ACK), the sidecar paradigm introduces an unacceptable tax on compute, memory, and latency. The solution lies deeper in the stack: bypassing user-space proxies entirely and moving observability into the Linux kernel socket layer using extended Berkeley Packet Filter (eBPF) on Aliyun Linux 3.

This tutorial explores how to replace CPU-heavy sidecars with kernel-level eBPF probes, enabling zero-overhead gRPC tracing and routing the resulting telemetry into Alibaba Cloud Log Service (SLS) and Application Real-Time Monitoring Service (ARMS).


1. The Sidecar Tax: Envoy Overhead at 100,000+ RPS


To understand why we must abandon the sidecar, we must dissect the anatomical failure of the sidecar model under extreme load.

When a pod running a gRPC microservice initiates an outbound call in a standard Istio environment, the traffic does not simply egress the node. Instead, the Linux iptables rules (specifically the PREROUTING and OUTPUT chains) intercept the packets, forcing a network address translation (NAT) redirect via conntrack into the Envoy sidecar container.


At 100,000+ RPS, this architecture breaks down in three specific ways:

  1. The Context Switch Penalty: A single gRPC request traverses the user-kernel boundary multiple times. The application writes to a socket (User → Kernel), iptables routes it to Envoy, Envoy reads it (Kernel → User), parses the HTTP/2 headers, writes it back to a new socket (User → Kernel), and finally sends it to the network interface. This context-switching overhead destroys tail latency.
  2. Memory Footprint: Envoy is notoriously memory-hungry when dealing with massive numbers of concurrent HTTP/2 streams and connection pools. In a 500-node ACK cluster running 10,000 pods, deploying 10,000 Envoy sidecars easily consumes hundreds of gigabytes of aggregate RAM just to maintain routing tables and telemetry state.
  3. The TCP Congestion Collapse: gRPC utilizes HTTP/2 multiplexing, meaning multiple concurrent requests share a single TCP connection. When Envoy intercepts this, the kernel’s TCP congestion control algorithms become confused. The sidecar acts as a buffer, leading to head-of-line blocking and micro-burst packet drops at the node’s qdisc layer.

The “Sidecar Tax” is the realization that you are paying Alibaba Cloud for CPU cores that are doing nothing but shuffling bytes between loopback interfaces.


2. The eBPF Paradigm: Kernel-Level Socket Hooking


eBPF fundamentally changes how we extract observability data. Instead of routing traffic to an observer (the sidecar), eBPF injects the observer into the traffic path (the kernel).

eBPF allows us to run sandboxed, event-driven C programs directly within the Linux kernel. When running ACK on Aliyun Linux 3 (AL3), we benefit from a highly optimized 5.10+ kernel that natively supports advanced eBPF features like bounded loops, CO-RE (Compile Once – Run Everywhere), and ring buffers.

For gRPC observability, we don’t need to intercept packets at the network interface layer (XDP or TC). By the time a packet hits the NIC, the gRPC/HTTP2 headers are encrypted via TLS. Instead, we hook higher up the stack, directly at the socket layer, specifically targeting kernel functions like tcp_sendmsg and tcp_recvmsg.

At this layer, the payload is often still in plaintext (if TLS termination happens elsewhere, or if we hook before userspace encryption libraries like OpenSSL via uprobes), allowing us to parse the HTTP/2 frames, extract the grpc-status, grpc-message, and trace IDs, and push these metrics to userspace asynchronously via eBPF ring buffers. The application thread is never blocked, and no traffic is ever redirected.


3. Implementation Details: eBPF on Aliyun Linux 3


To implement this on ACK, we first need to configure our worker nodes. Aliyun Linux 3 is the recommended OS for this architecture, as its kernel is specifically patched for high-performance cloud-native workloads.


Node Preparation

Deploy a DaemonSet to your ACK cluster to install the necessary BPF Compiler Collection (BCC) tools and kernel headers on the underlying AL3 nodes.

Bash

# Executed via privileged DaemonSet init-container
yum update -y
yum install -y bcc-tools bcc-devel kernel-devel-$(uname -r) elfutils-libelf-devel

The Kernel Probe: Hooking tcp_sendmsg

To trace gRPC traffic, we need to inspect the data being written to TCP sockets. gRPC uses HTTP/2, which is a binary framing protocol. We need an eBPF program that attaches to the kernel’s tcp_sendmsg function, reads the user-space memory containing the buffer, and looks for HTTP/2 magic bytes and HEADERS frames.

Below is a simplified eBPF C program utilizing the BCC framework. This code is compiled JIT (Just-In-Time) by BCC when deployed to the node.

C

#include <uapi/linux/ptrace.h>
#include <net/sock.h>
#include <bcc/proto.h>

// Define a ring buffer to send events to userspace
BPF_RINGBUF_OUTPUT(grpc_events, 256);

// Struct to hold our extracted telemetry
struct grpc_event_t {
    u32 pid;
    u32 daddr;
    u16 dport;
    char method[64];
    u64 latency_ns;
};

// Hook into the kernel's tcp_sendmsg function
int trace_tcp_sendmsg(struct pt_regs *ctx, struct sock *sk, struct msghdr *msg, size_t size) {
    u32 pid = bpf_get_current_pid_tgid() >> 32;
    
    // Filter out irrelevant traffic (e.g., node components)
    // Only trace our application namespace PIDs
    if (pid < 1000) return 0;

    struct grpc_event_t event = {};
    event.pid = pid;
    
    // Extract destination IP and Port from the socket struct
    bpf_probe_read_kernel(&event.daddr, sizeof(event.daddr), &sk->__sk_common.skc_daddr);
    bpf_probe_read_kernel(&event.dport, sizeof(event.dport), &sk->__sk_common.skc_dport);
    
    // --- HTTP/2 Framing Parsing Logic ---
    // In a production scenario, we must read the iov_iter from the msghdr,
    // locate the HTTP/2 HEADERS frame, and extract the :path pseudo-header 
    // which contains the gRPC method (e.g., /CheckoutService/ProcessTransaction).
    // 
    // Note: Reading user memory from a kprobe requires bpf_probe_read_user()
    // to prevent page faults in the kernel.
    
    // Submit the event asynchronously to userspace
    grpc_events.ringbuf_submit(&event, 0);

    return 0;
}

Once this eBPF program is running, a lightweight user-space Go daemon reads the grpc_events ring buffer. Because the heavy lifting (parsing and filtering) is done in the kernel, the user-space CPU footprint is minimal. This Go daemon then formats the traces and ships them directly into Alibaba Cloud SLS for durable storage and querying, and pushes aggregated metrics to ARMS for dashboarding.


4. The ‘MVP’ Failure Mode: Kernel Panics and the eBPF Verifier


When migrating to eBPF, Platform Engineers often hit a severe roadblock: the eBPF Verifier. The Linux kernel will refuse to load any eBPF bytecode that it cannot mathematically prove is safe. It must guarantee that your program will not crash, will not access out-of-bounds memory, and—crucially—will not run infinitely.


The Bounded Loop Problem

Parsing gRPC means parsing HTTP/2 frames. HTTP/2 frames have variable lengths, and a single TCP packet might contain multiple frames. A naive C programmer would write a while(true) loop to iterate through the buffer until the end is reached.

If you attempt this, the AL3 kernel verifier will instantly reject your program.

Plaintext

bpf: Failed to load program: R1 offset is outside of the packet
verifier log:
...
infinite loop detected at insn 42

The verifier simulates every possible execution path of your bytecode. If it cannot determine exactly when a loop will terminate, it assumes it might loop forever, which would lock up the kernel and cause a panic.


The Solution: #pragma unroll and Bounded Logic

To survive the verifier, you must write strictly bounded loops. You must explicitly tell the compiler the maximum number of iterations, allowing the verifier to calculate the worst-case execution time.

C

#define MAX_HTTP2_FRAMES 10

// Correctly bounded loop for the Verifier
#pragma unroll
for (int i = 0; i < MAX_HTTP2_FRAMES; i++) {
    // 1. Read frame length from header
    // 2. Ensure frame length does not exceed buffer size
    // 3. Process frame
    // 4. Advance buffer pointer
    
    if (buffer_pointer >= buffer_end) {
        break; // Safe exit
    }
}

Furthermore, every time you manipulate a pointer to read payload data, you must perform bounds checking. The verifier tracks pointer registers; if it sees you trying to read packet_start + offset without first checking if packet_start + offset < packet_end, it will reject the code.

Mastering the verifier is the true hurdle of eBPF observability. It requires a mindset shift from “write code that works” to “write code that the kernel knows works.”


5. Conclusion: Moving Towards Ambient Mesh Architectures


The era of injecting a full Layer 7 proxy into every pod is drawing to a close for hyper-scale environments. The compute and latency costs are simply too high for high-throughput, low-latency microservices.

By leveraging eBPF on Alibaba Cloud’s optimized Aliyun Linux 3 instances, we can achieve 100% visibility into gRPC traffic at 100,000+ RPS with negligible overhead. We extract the exact telemetry we need directly from the socket, securely route it to SLS for analysis, and bypass the complex iptables NAT routing that plagues standard ACK deployments.

This approach paves the way for the next evolution in Kubernetes networking: the ambient mesh. Technologies like Cilium and Istio Ambient Mesh are already moving in this direction, relying heavily on eBPF to separate Layer 4 routing from Layer 7 processing. By mastering eBPF kernel probes today, you future-proof your cluster architecture for the sidecar-less tomorrow.


Read more: 👉 Designing a Cloud Architecture That Survives Internet Shutdowns

Leave a Comment