[RFC DOCUMENT 05/12] kubevirt-and-kvm: Add Networking page

Andrea Bolognani <abologna@xxxxxxxxxx> · Wed, 16 Sep 2020 18:50:13 +0200

https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md

# Networking

## Problem description

Service meshes (such as [Istio][], [Linkerd][]) typically expect
application processes to run on the same physical host, usually in a
separate user namespace. Network namespaces might be used too, for
additional isolation. Network traffic to and from local processes is
monitored and proxied by redirection and observation of local
sockets. `iptables` and `nftables` (collectively referred to as the
`netfilter` framework) are the typical Linux facilities providing
classification and redirection of packets.

![containers][Networking-Containers]
*Service meshes with containers. Typical ingress path:
**1.** NIC driver queues buffers for IP processing
**2.** `netfilter` rules installed by *service mesh* redirect packets
       to proxy
**3.** IP receive path completes, L4 protocol handler invoked
**4.** TCP socket of proxy receives packets
**5.** proxy opens TCP socket towards application service
**6.** packets get TCP header, ready for classification
**7.** `netfilter` rules installed by service mesh forward request to
       service
**8.** local IP routing queues packets for TCP protocol handler
**9.** application process receives packets and handles request.
Egress path is conceptually symmetrical.*

If we move application processes to VMs, sockets and processes are
not visible anymore. All the traffic is typically forwarded via
interfaces operating at data link level. Socket redirection and port
mapping to local processes don't work.

![and now?][Networking-Challenge]
*Application process moved to VM:
**8.** IP layer enqueues packets to L2 interface towards application
**9.** `tap` driver forwards L2 packets to guest
**10.** packets are received on `virtio-net` ring buffer
**11.** guest driver queues buffers for IP processing
**12.** IP receive path completes, L4 protocol handler invoked
**13.** TCP socket of application receives packets and handles request.
**:warning: Proxy challenge**: the service mesh can't forward packets
to local sockets via `netfilter` rules. *Add-on* NAT rules might
conflict, as service meshes expect full control of the ruleset.
Socket monitoring and PID/UID classification isn't possible.*

## Existing solutions

Existing solutions typically implement a full TCP/IP stack, replaying
traffic on sockets local to the Pod of the service mesh. This creates
the illusion of application processes running on the same host,
eventually separated by user namespaces.

![slirp][Networking-Slirp]
*Existing solutions introduce a third TCP/IP stack:
**8.** local IP routing queues packets for TCP protocol handler
**9.** userspace implementation of TCP/IP stack receives packets on
       local socket, and
**10.** forwards L2 encapsulation to `tap` *QEMU* interface (socket
        back-end).*

While being almost transparent to the service mesh infrastructure,
this kind of solution comes with a number of downsides:

* three different TCP/IP stacks (guest, adaptation and host) need to
  be traversed for every service request. There are no chances to
  implement zero-copy mechanisms, and the amount of context switches
  increases dramatically
* addressing needs to be coordinated to create the pretense of
  consistent addresses and routes between guest and host
  environments. This typically needs a NAT with masquerading, or some
  form of packet bridging
* the traffic seen by the service mesh and observable externally is a
  distant replica of the packets forwarded to and from the guest
  environment:
  * TCP congestion windows and network buffering mechanisms in
    general operate differently from what would be naturally expected
    by the application
  * protocols carrying addressing information might pose additional
    challenges, as the applications don't see the same set of
    addresses and routes as they would if deployed with regular
    containers

## Experiments

![experiments: thin layer][Networking-Experiments-Thin-Layer]
*How can we improve on the existing solutions while maintaining
drop-in compatibility? A thin layer implements a TCP adaptation
and IP services.*

These are some directions we have been exploring so far:

* a thinner layer between guest and host, that only implements what's
  strictly needed to pretend processes are running locally. A further
  TCP/IP stack is not necessarily needed. Some sort of TCP adaptation
  is needed, however, if this layer (currently implemented as
  userspace process) runs without the `CAP_NET_RAW` capability: we
  can't create raw IP sockets on the Pod, and therefore need to map
  packets at layer 2 to layer 4 sockets offered by the host kernel
  * to avoid implementing an actual TCP/IP stack like the one
    offered by *libslirp*, we can align TCP parameters advertised
    towards the guest (MSS, congestion window) to the socket
    parameters provided by the host kernel, probing them via the
    `TCP_INFO` socket option (introduced in Linux 2.4).
    Segmentation and reassembly is therefore not needed, providing
    solid chances to avoid dynamic memory allocation altogether, and
    congestion control becomes implicitly equivalent as parameters
    are mirrored between the two sides
  * to reflect the actual receiving dynamic of the guest and support
    retransmissions without a permanent userspace buffer, segments
    are not dequeued (`MSG_PEEK`) until acknowledged by the receiver
    (application)
  * similarly, the implementation of the host-side sender adjusts MSS
    (`TCP_MAXSEG` socket option, since Linux 2.6.28) and advertised
    window (`TCP_WINDOW_CLAMP`, since Linux 2.4) to the parameters
    observed from incoming packets
  * this adaptation layer needs to maintain some of the TCP states,
    but we can rely on the host kernel TCP implementation for the
    different states of connections being closed
  * no particular requirements are placed on the MTU of guest
    interfaces: if fragments are received, payload from the single
    fragmented packets can be reassembled by the host kernel as
    needed, and out-of-order fragments can be safely discarded, as
    there's no intermediate hop justifying the condition
* this layer would connect to `qemu` over a *UNIX domain socket*,
  instead of a `tap` interface, so that the `CAP_NET_ADMIN`
  capability doesn't need to be granted to any process on the Pod:
  no further network interfaces are created on the host
* transparent, adaptive mapping of ports to the guest, to avoid the
  need for explicit port forwarding
* security and maintainability goals: no dynamic memory allocation,
  ~2 000 *LoC* target, no external dependencies

![experiments: ebpf][Networking-Experiments-eBPF]
*Additionally, an `eBPF` fast path could be implemented
**6.** hooking at socket level, and
**7.** mapping IP and Ethernet addresses,
with the existing layer implementing connection tracking and slow
path*

If additional capabilities are granted, the data path can be
optimised in several ways:

* with `CAP_NET_RAW`:
  * the adaptation layer can use raw IP sockets instead of L4 sockets,
    implementing a pure connection tracking, without the need for any
    TCP logic: the guest operating system implements the single TCP
    stack needed with this variation
  * zero-copy mechanisms could be implemented using `vhost-user` and
    QEMU socket back-ends, instead of relying on a full-fledged layer 2
    (Ethernet) interface
* with `CAP_BPF` and `CAP_NET_ADMIN`:
  * context switching in packet forwarding could be avoided by the
    `sockmap` extension provided by `eBPF`, and programming the `XDP`
    data hooks for in-kernel data transfers
  * using eBPF programs, we might want to switch (dynamically?) to
    the `vhost-net` facility
  * the userspace process would still need to take care of
    establishing in-kernel flows, and providing IP and IPv6
    services (ARP, DHCP, NDP) for addressing transparency and to
    avoid the need for further capabilities (e.g.
    `CAP_NET_BIND_SERVICE`), but the main, fast datapath would
    reside entirely in the kernel

[Istio]: https://istio.io/
[Linkerd]: https://linkerd.io/
[Networking-Challenge]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Networking-Challenge.png
[Networking-Containers]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Networking-Containers.png
[Networking-Experiments-Thin-Layer]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Networking-Experiments-Thin-Layer.png
[Networking-Experiments-eBPF]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Networking-Experiments-eBPF.png
[Networking-Slirp]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Networking-Slirp.png