On Wed, Nov 25, 2020 at 10:54:22AM -0800, Jakub Kicinski wrote: > > RDMA covers a wide range of accelerated networking these days.. Where > > else are you going to put this stuff in the kernel? > > IDK what else you got in there :) It's probably a case by case answer. Hmm, yes, it seems endless sometimes :( > IMHO even using libibverbs is no strong reason for things to fall under > RDMA exclusively. Client drivers of virtio don't get silently funneled > through a separate tree just because they use a certain spec. I'm not sure I understand this, libibverbs is the user library to interface with the kernel RDMA subsystem. I don't care what apps people build on top of it, it doesn't matter to me that netdev and DPDK have some kind of feud. > > > I'm sure if you start doing crypto over ibverbs crypto people will want > > > to have a look. > > > > Well, RDMA has crypto transforms for a few years now too. > > Are you talking about RDMA traffic being encrypted? That's a different > case. That too, but in general, anything netdev can do can be done via RDMA in userspace. So all the kTLS and IPSEC xfrm HW offloads mlx5 supports are all available in userspace too. > > Part of the point of the subsystem split was to end the fighting that > > started all of it. It was very clear during the whole iWarp and TCP > > Offload Engine buisness in the mid 2000's that netdev wanted nothing > > to do with the accelerator world. > > I was in middle school at the time, not sure what exactly went down :) Ah, it was quite the thing. Microsoft and Co were heavilly pushing TOE technology (Microsoft Chimney!) as the next most certain thing and I recall DaveM&co was completely against it in Linux. I will admit at the time I was doubtful, but in hindsight this was the correct choice. netdev would not look like it does today if it had been shackled by the HW implementations of the day. Instead all this HW stuff ended up largely in RDMA and some in block with the iSCSI mania of old. It is quite evident to me the mess being tied to HW has caused to a SW ecosystem. DRM and RDMA both have a very similiar kind of suffering due to this. However - over the last 20 years it has been steadfast that there is *always* a compelling reason for certain applications to use something from the accelerator side. It is not for everyone, but the specialized applications that need it, *really need it*. For instance, it is the difference between being able to get a COVID simulation result in a few week vs .. well.. never. > But I'm going by common sense here. Perhaps there was an agreement > I'm not aware of? The resolution to the argument above was to split them in Linux. Thus what logically is networking was split up in the kernel between netdev and the accelerator subsystems (iscsi, rdma, and so on). The general notion is netdev doesn't have to accomodate anything an accelerator does. If you choose to run them then you do not get to complain that your ethtool counters are wrong, your routing tables and tc don't work, firewalling doesn't work. Etc. That is all broken by design. In turn, the accelerators do their own thing, tap the traffic before it hits netdev and so on. netdev does not care what goes on over there and is not responsible. I would say this is the basic unspoken agreement of the last 15 years. Both have a right to exist in Linux. Both have a right to use the physical ethernet port. > > So why would netdev need sign off on any accelerator stuff? > > I'm not sure why you keep saying accelerators! > > What is accelerated in raw Ethernet frame access?? The nature of the traffic is not relavent. It goes through RDMA, it is accelerator traffic (vs netdev traffic, which goes to netdev). Even if you want to be pedantic, in the raw ethernet area there is lots of HW special accelerated stuff going on. Mellanox has some really neat hard realtime networking technology that works on raw ethernet packets, for instance. And of course raw ethernet is a fraction of what RDMA covers. iWarp and RoCE are much more like you might imagine when you hear the word accelerator. > > Do you want to start co-operating now? I'm willing to talk about how > > to do that. > > IDK how that's even in question. I always try to bump all RDMA-looking > stuff to linux-rdma when it's not CCed there. That's the bare minimum > of cooperation I'd expect from anyone. I mean co-operate in the sense of defining a scheme where the two worlds are not completely seperated and isolated. > > > And our policy on DPDK is pretty widely known. > > > > I honestly have no idea on the netdev DPDK policy, > > > > I'm maintaining the RDMA subsystem not DPDK :) > > That's what I thought, but turns out DPDK is your important user. Nonsense. I don't have stats but the majority of people I work with using RDMA are not using DPDK. DPDK serves two somewhat niche markets of NFV and certain hyperscalers - RDMA covers the entire scientific computing community and a big swath of the classic "Big Iron" enterprise stuff, like databases and storage. > Now IIUC you're tapping traffic for DPDK/raw QPs _before_ all switching > happens in the NIC? That breaks the switchdev model. We're back to > per-vendor magic. No, as I explained before, the switchdev completely contains the SF/VF and all applications running on a mlx5_core are trapped by it. This includes netdev, RDMA and VDPA. > And why do you need a separate VDPA table in the first place? > Forwarding to a VDPA device has different semantics than forwarding to > any other VF/SF? The VDPA table is not switchdev. Go back to my overly long email about VDPA, here we are talking about the "selector" that chooses which subsystem the traffic will go to. The selector is after switchdev but before netdev, VDPA, RDMA. Each accelerator subsystem gets a table. RDMA, VDPA, and netdev all get one. It is some part of the HW to make the selectoring work. Jason