https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md # Networking ## Problem description Service meshes (such as [Istio][], [Linkerd][]) typically expect application processes to run on the same physical host, usually in a separate user namespace. Network namespaces might be used too, for additional isolation. Network traffic to and from local processes is monitored and proxied by redirection and observation of local sockets. `iptables` and `nftables` (collectively referred to as the `netfilter` framework) are the typical Linux facilities providing classification and redirection of packets. ![containers][Networking-Containers] *Service meshes with containers. Typical ingress path: **1.** NIC driver queues buffers for IP processing **2.** `netfilter` rules installed by *service mesh* redirect packets to proxy **3.** IP receive path completes, L4 protocol handler invoked **4.** TCP socket of proxy receives packets **5.** proxy opens TCP socket towards application service **6.** packets get TCP header, ready for classification **7.** `netfilter` rules installed by service mesh forward request to service **8.** local IP routing queues packets for TCP protocol handler **9.** application process receives packets and handles request. Egress path is conceptually symmetrical.* If we move application processes to VMs, sockets and processes are not visible anymore. All the traffic is typically forwarded via interfaces operating at data link level. Socket redirection and port mapping to local processes don't work. ![and now?][Networking-Challenge] *Application process moved to VM: **8.** IP layer enqueues packets to L2 interface towards application **9.** `tap` driver forwards L2 packets to guest **10.** packets are received on `virtio-net` ring buffer **11.** guest driver queues buffers for IP processing **12.** IP receive path completes, L4 protocol handler invoked **13.** TCP socket of application receives packets and handles request. **:warning: Proxy challenge**: the service mesh can't forward packets to local sockets via `netfilter` rules. *Add-on* NAT rules might conflict, as service meshes expect full control of the ruleset. Socket monitoring and PID/UID classification isn't possible.* ## Existing solutions Existing solutions typically implement a full TCP/IP stack, replaying traffic on sockets local to the Pod of the service mesh. This creates the illusion of application processes running on the same host, eventually separated by user namespaces. ![slirp][Networking-Slirp] *Existing solutions introduce a third TCP/IP stack: **8.** local IP routing queues packets for TCP protocol handler **9.** userspace implementation of TCP/IP stack receives packets on local socket, and **10.** forwards L2 encapsulation to `tap` *QEMU* interface (socket back-end).* While being almost transparent to the service mesh infrastructure, this kind of solution comes with a number of downsides: * three different TCP/IP stacks (guest, adaptation and host) need to be traversed for every service request. There are no chances to implement zero-copy mechanisms, and the amount of context switches increases dramatically * addressing needs to be coordinated to create the pretense of consistent addresses and routes between guest and host environments. This typically needs a NAT with masquerading, or some form of packet bridging * the traffic seen by the service mesh and observable externally is a distant replica of the packets forwarded to and from the guest environment: * TCP congestion windows and network buffering mechanisms in general operate differently from what would be naturally expected by the application * protocols carrying addressing information might pose additional challenges, as the applications don't see the same set of addresses and routes as they would if deployed with regular containers ## Experiments ![experiments: thin layer][Networking-Experiments-Thin-Layer] *How can we improve on the existing solutions while maintaining drop-in compatibility? A thin layer implements a TCP adaptation and IP services.* These are some directions we have been exploring so far: * a thinner layer between guest and host, that only implements what's strictly needed to pretend processes are running locally. A further TCP/IP stack is not necessarily needed. Some sort of TCP adaptation is needed, however, if this layer (currently implemented as userspace process) runs without the `CAP_NET_RAW` capability: we can't create raw IP sockets on the Pod, and therefore need to map packets at layer 2 to layer 4 sockets offered by the host kernel * to avoid implementing an actual TCP/IP stack like the one offered by *libslirp*, we can align TCP parameters advertised towards the guest (MSS, congestion window) to the socket parameters provided by the host kernel, probing them via the `TCP_INFO` socket option (introduced in Linux 2.4). Segmentation and reassembly is therefore not needed, providing solid chances to avoid dynamic memory allocation altogether, and congestion control becomes implicitly equivalent as parameters are mirrored between the two sides * to reflect the actual receiving dynamic of the guest and support retransmissions without a permanent userspace buffer, segments are not dequeued (`MSG_PEEK`) until acknowledged by the receiver (application) * similarly, the implementation of the host-side sender adjusts MSS (`TCP_MAXSEG` socket option, since Linux 2.6.28) and advertised window (`TCP_WINDOW_CLAMP`, since Linux 2.4) to the parameters observed from incoming packets * this adaptation layer needs to maintain some of the TCP states, but we can rely on the host kernel TCP implementation for the different states of connections being closed * no particular requirements are placed on the MTU of guest interfaces: if fragments are received, payload from the single fragmented packets can be reassembled by the host kernel as needed, and out-of-order fragments can be safely discarded, as there's no intermediate hop justifying the condition * this layer would connect to `qemu` over a *UNIX domain socket*, instead of a `tap` interface, so that the `CAP_NET_ADMIN` capability doesn't need to be granted to any process on the Pod: no further network interfaces are created on the host * transparent, adaptive mapping of ports to the guest, to avoid the need for explicit port forwarding * security and maintainability goals: no dynamic memory allocation, ~2 000 *LoC* target, no external dependencies ![experiments: ebpf][Networking-Experiments-eBPF] *Additionally, an `eBPF` fast path could be implemented **6.** hooking at socket level, and **7.** mapping IP and Ethernet addresses, with the existing layer implementing connection tracking and slow path* If additional capabilities are granted, the data path can be optimised in several ways: * with `CAP_NET_RAW`: * the adaptation layer can use raw IP sockets instead of L4 sockets, implementing a pure connection tracking, without the need for any TCP logic: the guest operating system implements the single TCP stack needed with this variation * zero-copy mechanisms could be implemented using `vhost-user` and QEMU socket back-ends, instead of relying on a full-fledged layer 2 (Ethernet) interface * with `CAP_BPF` and `CAP_NET_ADMIN`: * context switching in packet forwarding could be avoided by the `sockmap` extension provided by `eBPF`, and programming the `XDP` data hooks for in-kernel data transfers * using eBPF programs, we might want to switch (dynamically?) to the `vhost-net` facility * the userspace process would still need to take care of establishing in-kernel flows, and providing IP and IPv6 services (ARP, DHCP, NDP) for addressing transparency and to avoid the need for further capabilities (e.g. `CAP_NET_BIND_SERVICE`), but the main, fast datapath would reside entirely in the kernel [Istio]: https://istio.io/ [Linkerd]: https://linkerd.io/ [Networking-Challenge]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Networking-Challenge.png [Networking-Containers]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Networking-Containers.png [Networking-Experiments-Thin-Layer]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Networking-Experiments-Thin-Layer.png [Networking-Experiments-eBPF]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Networking-Experiments-eBPF.png [Networking-Slirp]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Networking-Slirp.png