On Tue, Apr 9, 2024 at 8:39 AM Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: > > On Tue, Apr 09, 2024 at 07:43:07AM -0700, Alexander Duyck wrote: > > > I see. So this is what you were referencing. Arguably I can see both > > sides of the issue. Ideally what should have been presented would have > > been the root cause of why the diff > > Uh, that almost never happens in the kernel world. Someone does a > great favour to us all to test rc kernels and finds bugs. The Thus why I mentioned "Ideally". Most often that cannot be the case due to various reasons. However, that said that would have been the Ideal solution, not the practical one. > expectation is generally things like: > > - The bug is fixed immediately because the issue is obvious to the > author > - Iteration and rapid progress is seen toward enlightening the author > - The patch is reverted, often rapidly, try again later with a good > patch When working on a development branch that shouldn't be the expectation. I suspect that is why the revert was pushed back on initially. The developer wanted a chance to try to debug and resolve the issue with root cause. Honestly what I probably would have proposed was a build flag that would have allowed the code to stay but be disabled with a "Broken" label to allow both developers to work on their own thing. Then if people complained about the RFC non-compliance issue, but didn't care about the Vagrant setup they could have just turned it on to test and verify it fixed their issue and get additional testing. However I assume that would have introduced additional maintenance overhead. > Unsophisticated reporters should not experience regressions, > period. Unsophisticated reporters shuld not be expected to debug > things on their own (though it sure is nice if they can!). We really > like it and appreciate it if reporters can run experiments! Unsophisticated reporters/users shouldn't be running net-next. If this has made it to or is about to go into Linus's tree then I would agree the regression needs to be resolved ASAP as that stuff shouldn't exist past rc1 at the latest. > In this particular instance there was some resistance getting to a fix > quickly. I think a revert for something like this that could not be > immediately fixed is the correct thing, especially when it effects > significant work within the community. It gives the submitter time to > find out how to solve the regression. > > That there is now so much ongoing bad blood over such an ordinary > matter is what is really distressing here. Well much of it has to do with the fact that this is supposed to be a community. Generally I help you, you help me and together we both make progress. So within the community people tend to build up what we could call karma. Generally I think some of the messages sent seemed to make it come across that the Mellanox/Nvidia folks felt it "wasn't their problem" so they elicited a bit of frustration from the other maintainers and built up some negative karma. As I had mentioned in the case of the e1000e NVRAM corruption. It wasn't an Intel issue that caused the problem but Intel had to jump in to address it until they found the root cause that was function tracing. Unfortunately one thing that tends to happen with upstream is that we get asked to do things that aren't directly related to the project we are working on. We saw that at Intel quite often. I referred to it at one point as the "you stepped in it, you own it" phenomenon where if we even brushed against block of upstream code that wasn't being well maintained we would be asked to fix it up and address existing issues before we could upstream any patches. > I think Leon's point is broadly that those on the "vendor" side seem > to often be accused of being a "bad vendor". I couldn't help but > notice the language from Meta on this thread seemed to place Meta > outside of being a vendor, despite having always very much been doing > typical vendor activities like downstream forks, proprietary userspace > and now drivers for their own devices. I wouldn't disagree that we are doing "vendor" things. Up until about 4 years ago I was on the "vendor" side at Intel. One difference is that Meta is also the "consumer". So if I report an issue it is me complaining about something as a sophisticated user instead of a unsophisticated one. So hopefully we have gone though and done some triage to at least bisect it down to a patch and are willing to work with the community as you guys did. If we can work with the other maintainers to enable them to debug and root cause the issue then even better. The revert is normally the weapon of last resort to be broken out before the merge window opens, or if an issue is caught in Linus's tree. > In my view the vendor/!vendor distinction is really toxic and should > stop. I agree. However that was essentially what started all this when Jiri pointed out that we weren't selling the NIC to anyone else. That made this all about vendor vs !vendor, and his suggestion of just giving the NICs away isn't exactly practical. At least not an any sort of large scale. Maybe we should start coming up with a new term for the !vendor case. How about "prosumer", as in "producer" and "consumer"?