On Tue, 24 Nov 2020 15:44:13 -0400 Jason Gunthorpe wrote: > On Tue, Nov 24, 2020 at 10:41:06AM -0800, Jakub Kicinski wrote: > > On Tue, 24 Nov 2020 14:02:10 -0400 Jason Gunthorpe wrote: > > > On Tue, Nov 24, 2020 at 09:12:19AM -0800, Jakub Kicinski wrote: > > > > On Sun, 22 Nov 2020 08:41:58 +0200 Eli Cohen wrote: > > > > > On Sat, Nov 21, 2020 at 04:01:55PM -0800, Jakub Kicinski wrote: > > > > > > On Fri, 20 Nov 2020 15:03:34 -0800 Saeed Mahameed wrote: > > > > > > > From: Eli Cohen <eli@xxxxxxxxxxxx> > > > > > > > > > > > > > > Add a new namespace type to the NIC RX root namespace to allow for > > > > > > > inserting VDPA rules before regular NIC but after bypass, thus allowing > > > > > > > DPDK to have precedence in packet processing. > > > > > > > > > > > > How does DPDK and VDPA relate in this context? > > > > > > > > > > mlx5 steering is hierarchical and defines precedence amongst namespaces. > > > > > Up till now, the VDPA implementation would insert a rule into the > > > > > MLX5_FLOW_NAMESPACE_BYPASS hierarchy which is used by DPDK thus taking > > > > > all the incoming traffic. > > > > > > > > > > The MLX5_FLOW_NAMESPACE_VDPA hirerachy comes after > > > > > MLX5_FLOW_NAMESPACE_BYPASS. > > > > > > > > Our policy was no DPDK driver bifurcation. There's no asterisk saying > > > > "unless you pretend you need flow filters for RDMA, get them upstream > > > > and then drop the act". > > > > > > Huh? > > > > > > mlx5 DPDK is an *RDMA* userspace application. > > > > Forgive me for my naiveté. > > > > Here I thought the RDMA subsystem is for doing RDMA. > > RDMA covers a wide range of accelerated networking these days.. Where > else are you going to put this stuff in the kernel? IDK what else you got in there :) It's probably a case by case answer. IMHO even using libibverbs is no strong reason for things to fall under RDMA exclusively. Client drivers of virtio don't get silently funneled through a separate tree just because they use a certain spec. > > I'm sure if you start doing crypto over ibverbs crypto people will want > > to have a look. > > Well, RDMA has crypto transforms for a few years now too. Are you talking about RDMA traffic being encrypted? That's a different case. My example was alluding to access to a generic crypto accelerator over ibverbs. I hope you'd let crypto people know when merging such a thing... > Why would crypto subsystem people be involved? It isn't using or > duplicating their APIs. > > > > libibverbs. It runs on the RDMA stack. It uses RDMA flow > > > filtering and RDMA raw ethernet QPs. > > > > I'm not saying that's not the case. I'm saying I don't think this > > was something that netdev developers signed-off on. > > Part of the point of the subsystem split was to end the fighting that > started all of it. It was very clear during the whole iWarp and TCP > Offload Engine buisness in the mid 2000's that netdev wanted nothing > to do with the accelerator world. I was in middle school at the time, not sure what exactly went down :) But I'm going by common sense here. Perhaps there was an agreement I'm not aware of? > So why would netdev need sign off on any accelerator stuff? I'm not sure why you keep saying accelerators! What is accelerated in raw Ethernet frame access?? > Do you want to start co-operating now? I'm willing to talk about how > to do that. IDK how that's even in question. I always try to bump all RDMA-looking stuff to linux-rdma when it's not CCed there. That's the bare minimum of cooperation I'd expect from anyone. > > And our policy on DPDK is pretty widely known. > > I honestly have no idea on the netdev DPDK policy, > > I'm maintaining the RDMA subsystem not DPDK :) That's what I thought, but turns out DPDK is your important user. > > Would you mind pointing us to the introduction of raw Ethernet QPs? > > > > Is there any production use for that without DPDK? > > Hmm.. It is very old. RAW (InfiniBand) QPs were part of the original > IBA specification cira 2000. When RoCE was defined (around 2010) they > were naturally carried forward to Ethernet. The "flow steering" > concept to make raw ethernet QP useful was added to verbs around 2012 > - 2013. It officially made it upstream in commit 436f2ad05a0b > ("IB/core: Export ib_create/destroy_flow through uverbs") > > If I recall properly the first real application was ultra low latency > ethernet processing for financial applications. > > dpdk later adopted the first mlx4 PMD using this libibverbs API around > 2015. Interestingly the mlx4 PMD was made through an open source > process with minimal involvment from Mellanox, based on the > pre-existing RDMA work. > > Currently there are many projects, and many open source, built on top > of the RDMA raw ethernet QP and RDMA flow steering model. It is now > long established kernel ABI. > > > > It has been like this for years, it is not some "act". > > > > > > It is long standing uABI that accelerators like RDMA/etc get to > > > take the traffic before netdev. This cannot be reverted. I don't > > > really understand what you are expecting here? > > > > Same. I don't really know what you expect me to do either. I don't > > think I can sign-off on kernel changes needed for DPDK. > > This patch is fine tuning the shared logic that splits the traffic to > accelerator subsystems, I don't think netdev should have a veto > here. This needs to be consensus among the various communities and > subsystems that rely on this. > > Eli did not explain this well in his commit message. When he said DPDK > he means RDMA which is the owner of the FLOW_NAMESPACE. Each > accelerator subsystem gets hooked into this, so here VPDA is getting > its own hook because re-using the the same hook between two kernel > subsystems is buggy. I'm not so sure about this. The switchdev modeling is supposed to give users control over flow of traffic in a sane, well defined way, as opposed to magic flow filtering of the early SR-IOV implementations which every vendor had their own twist on. Now IIUC you're tapping traffic for DPDK/raw QPs _before_ all switching happens in the NIC? That breaks the switchdev model. We're back to per-vendor magic. And why do you need a separate VDPA table in the first place? Forwarding to a VDPA device has different semantics than forwarding to any other VF/SF?