On Wed, 10 Feb 2021 23:52:39 +0100 Toke Høiland-Jørgensen wrote: > Jakub Kicinski <kuba@xxxxxxxxxx> writes: > > On Wed, 10 Feb 2021 11:53:53 +0100 Toke Høiland-Jørgensen wrote: > >> While I do agree that that kind of conformance test would be great, I > >> don't think it has to hold up this series (the perfect being the enemy > >> of the good, and all that). We have a real problem today that userspace > >> can't tell if a given driver implements, say, XDP_REDIRECT, and so > >> people try to use it and spend days wondering which black hole their > >> packets disappear into. And for things like container migration we need > >> to be able to predict whether a given host supports a feature *before* > >> we start the migration and try to use it. > > > > Unless you have a strong definition of what XDP_REDIRECT means the flag > > itself is not worth much. We're not talking about normal ethtool feature > > flags which are primarily stack-driven, XDP is implemented mostly by > > the driver, each vendor can do their own thing. Maybe I've seen one > > vendor incompatibility too many at my day job to hope for the best... > > I'm totally on board with documenting what a feature means. We're trying documentation in devlink etc. and it's not that great. It's never clear and comprehensive enough, barely anyone reads it. > E.g., for > XDP_REDIRECT, whether it's acceptable to fail the redirect in some > situations even when it's active, or if there should always be a > slow-path fallback. > > But I disagree that the flag is worthless without it. People are running > into real issues with trying to run XDP_REDIRECT programs on a driver > that doesn't support it at all, and it's incredibly confusing. The > latest example popped up literally yesterday: > > https://lore.kernel.org/xdp-newbies/CAM-scZPPeu44FeCPGO=Qz=03CrhhfB1GdJ8FNEpPqP_G27c6mQ@xxxxxxxxxxxxxx/ To help such confusion we'd actually have to validate the program against the device caps. But perhaps I'm less concerned about a newcomer not knowing how to use things and more concerned about providing abstractions which will make programs dependably working across vendors and HW generations. > >> I view the feature flags as a list of features *implemented* by the > >> driver. Which should be pretty static in a given kernel, but may be > >> different than the features currently *enabled* on a given system (due > >> to, e.g., the TX queue stuff). > > > > Hm, maybe I'm not being clear enough. The way XDP_REDIRECT (your > > example) is implemented across drivers differs in a meaningful ways. > > Hence the need for conformance testing. We don't have a golden SW > > standard to fall back on, like we do with HW offloads. > > I'm not disagreeing that we need to harmonise what "implementing a > feature" means. Maybe I'm just not sure what you mean by "conformance > testing"? What would that look like, specifically? We developed a pretty good set of tests at my previous job for testing driver XDP as well as checking that the offload conforms to the SW behavior. I assume any vendor who takes quality seriously has comprehensive XDP tests. If those tests were upstream / common so that we could run them against every implementation - the features which are supported by a driver fall out naturally out of the set of tests which passed. And the structure of the capability API could be based on what the tests need to know to make a SKIP vs FAIL decision. Common tests would obviously also ease the validation burden, burden of writing tests on vendors and make it far easier for new implementations to be confidently submitted. > A script in selftest that sets up a redirect between two interfaces > that we tell people to run? Or what? How would you catch, say, that > issue where if a machine has more CPUs than the NIC has TXQs things > start falling apart? selftests should be a good place, but I don't mind the location. The point is having tests which anyone (vendors and users) can run to test their platforms. One of the tests should indeed test if every CPU in the platform can XDP_REDIRECT. Shouldn't it be a rather trivial combination of tun/veth, mh and taskset? > > Also IDK why those tests are considered such a huge ask. As I said most > > vendors probably already have them, and so I'd guess do good distros. > > So let's work together. > > I guess what I'm afraid of is that this will end up delaying or stalling > a fix for a long-standing issue (which is what I consider this series as > shown by the example above). Maybe you can alleviate that by expanding a > bit on what you mean? I hope what I wrote helps a little. I'm not good at explaining. Perhaps I had seen one too many vendor incompatibility to trust that adding a driver API without a validation suite will result in something usable in production settings.