On Tue, 1 Oct 2024 14:54:29 +0900 Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote: > On 2024/09/30 0:33, Stephen Hemminger wrote: > > On Sun, 29 Sep 2024 16:10:47 +0900 > > Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote: > > > >> On 2024/09/29 11:07, Jason Wang wrote: > >>> On Fri, Sep 27, 2024 at 3:51 PM Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote: > >>>> > >>>> On 2024/09/27 13:31, Jason Wang wrote: > >>>>> On Fri, Sep 27, 2024 at 10:11 AM Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote: > >>>>>> > >>>>>> On 2024/09/25 12:30, Jason Wang wrote: > >>>>>>> On Tue, Sep 24, 2024 at 5:01 PM Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote: > >>>>>>>> > >>>>>>>> virtio-net have two usage of hashes: one is RSS and another is hash > >>>>>>>> reporting. Conventionally the hash calculation was done by the VMM. > >>>>>>>> However, computing the hash after the queue was chosen defeats the > >>>>>>>> purpose of RSS. > >>>>>>>> > >>>>>>>> Another approach is to use eBPF steering program. This approach has > >>>>>>>> another downside: it cannot report the calculated hash due to the > >>>>>>>> restrictive nature of eBPF. > >>>>>>>> > >>>>>>>> Introduce the code to compute hashes to the kernel in order to overcome > >>>>>>>> thse challenges. > >>>>>>>> > >>>>>>>> An alternative solution is to extend the eBPF steering program so that it > >>>>>>>> will be able to report to the userspace, but it is based on context > >>>>>>>> rewrites, which is in feature freeze. We can adopt kfuncs, but they will > >>>>>>>> not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM > >>>>>>>> and vhost_net). > >>>>>>>> > >>>>>>> > >>>>>>> I wonder if we could clone the skb and reuse some to store the hash, > >>>>>>> then the steering eBPF program can access these fields without > >>>>>>> introducing full RSS in the kernel? > >>>>>> > >>>>>> I don't get how cloning the skb can solve the issue. > >>>>>> > >>>>>> We can certainly implement Toeplitz function in the kernel or even with > >>>>>> tc-bpf to store a hash value that can be used for eBPF steering program > >>>>>> and virtio hash reporting. However we don't have a means of storing a > >>>>>> hash type, which is specific to virtio hash reporting and lacks a > >>>>>> corresponding skb field. > >>>>> > >>>>> I may miss something but looking at sk_filter_is_valid_access(). It > >>>>> looks to me we can make use of skb->cb[0..4]? > >>>> > >>>> I didn't opt to using cb. Below is the rationale: > >>>> > >>>> cb is for tail call so it means we reuse the field for a different > >>>> purpose. The context rewrite allows adding a field without increasing > >>>> the size of the underlying storage (the real sk_buff) so we should add a > >>>> new field instead of reusing an existing field to avoid confusion. > >>>> > >>>> We are however no longer allowed to add a new field. In my > >>>> understanding, this is because it is an UAPI, and eBPF maintainers found > >>>> it is difficult to maintain its stability. > >>>> > >>>> Reusing cb for hash reporting is a workaround to avoid having a new > >>>> field, but it does not solve the underlying problem (i.e., keeping eBPF > >>>> as stable as UAPI is unreasonably hard). In my opinion, adding an ioctl > >>>> is a reasonable option to keep the API as stable as other virtualization > >>>> UAPIs while respecting the underlying intention of the context rewrite > >>>> feature freeze. > >>> > >>> Fair enough. > >>> > >>> Btw, I remember DPDK implements tuntap RSS via eBPF as well (probably > >>> via cls or other). It might worth to see if anything we miss here. > >> > >> Thanks for the information. I wonder why they used cls instead of > >> steering program. Perhaps it may be due to compatibility with macvtap > >> and ipvtap, which don't steering program. > >> > >> Their RSS implementation looks cleaner so I will improve my RSS > >> implementation accordingly. > >> > > > > DPDK needs to support flow rules. The specific case is where packets > > are classified by a flow, then RSS is done across a subset of the queues. > > The support for flow in TUN driver is more academic than useful, > > I fixed it for current BPF, but doubt anyone is using it really. > > > > A full steering program would be good, but would require much more > > complexity to take a general set of flow rules then communicate that > > to the steering program. > > > > It reminded me of RSS context and flow filter. Some physical NICs > support to use a dedicated RSS context for packets matched with flow > filter, and virtio is also gaining corresponding features. > > RSS context: https://github.com/oasis-tcs/virtio-spec/issues/178 > Flow filter: https://github.com/oasis-tcs/virtio-spec/issues/179 > > I considered about the possibility of supporting these features with tc > instead of adding ioctls to tuntap, but it seems not appropriate for > virtualization use case. > > In a virtualization use case, tuntap is configured according to requests > of guests, and the code processing these requests need to have minimal > permissions for security. This goal is achieved by passing a file > descriptor that represents a tuntap from a privileged process (e.g., > libvirt) to the process handling guest requests (e.g., QEMU). > > However, tc is configured with rtnetlink, which does not seem to have an > interface to delegate a permission for one particular device to another > process. > > For now I'll continue working on the current approach that is based on > ioctl and lacks RSS context and flow filter features. Eventually they > are also likely to require new ioctls if they are to be supported with > vhost_net. The DPDK flow handling (rte_flow) was started by Mellanox and many of the features are to support what that NIC can do. Would be good to have a tc way to configure that (or devlink).