Re: [net-next PATCH 00/15] eth: fbnic: Add network driver for Meta Platforms Host Network Interface

Alexander Duyck <alexander.duyck@xxxxxxxxx> · Tue, 9 Apr 2024 09:31:06 -0700

On Tue, Apr 9, 2024 at 8:39 AM Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
>
> On Tue, Apr 09, 2024 at 07:43:07AM -0700, Alexander Duyck wrote:
>
> > I see. So this is what you were referencing. Arguably I can see both
> > sides of the issue. Ideally what should have been presented would have
> > been the root cause of why the diff
>
> Uh, that almost never happens in the kernel world. Someone does a
> great favour to us all to test rc kernels and finds bugs. The

Thus why I mentioned "Ideally". Most often that cannot be the case due
to various reasons. However, that said that would have been the Ideal
solution, not the practical one.

> expectation is generally things like:
>
>  - The bug is fixed immediately because the issue is obvious to the
>    author
>  - Iteration and rapid progress is seen toward enlightening the author
>  - The patch is reverted, often rapidly, try again later with a good
>    patch

When working on a development branch that shouldn't be the
expectation. I suspect that is why the revert was pushed back on
initially. The developer wanted a chance to try to debug and resolve
the issue with root cause.

Honestly what I probably would have proposed was a build flag that
would have allowed the code to stay but be disabled with a "Broken"
label to allow both developers to work on their own thing. Then if
people complained about the RFC non-compliance issue, but didn't care
about the Vagrant setup they could have just turned it on to test and
verify it fixed their issue and get additional testing. However I
assume that would have introduced additional maintenance overhead.

> Unsophisticated reporters should not experience regressions,
> period. Unsophisticated reporters shuld not be expected to debug
> things on their own (though it sure is nice if they can!). We really
> like it and appreciate it if reporters can run experiments!

Unsophisticated reporters/users shouldn't be running net-next. If this
has made it to or is about to go into Linus's tree then I would agree
the regression needs to be resolved ASAP as that stuff shouldn't exist
past rc1 at the latest.

> In this particular instance there was some resistance getting to a fix
> quickly. I think a revert for something like this that could not be
> immediately fixed is the correct thing, especially when it effects
> significant work within the community. It gives the submitter time to
> find out how to solve the regression.
>
> That there is now so much ongoing bad blood over such an ordinary
> matter is what is really distressing here.

Well much of it has to do with the fact that this is supposed to be a
community. Generally I help you, you help me and together we both make
progress. So within the community people tend to build up what we
could call karma. Generally I think some of the messages sent seemed
to make it come across that the Mellanox/Nvidia folks felt it "wasn't
their problem" so they elicited a bit of frustration from the other
maintainers and built up some negative karma.

As I had mentioned in the case of the e1000e NVRAM corruption. It
wasn't an Intel issue that caused the problem but Intel had to jump in
to address it until they found the root cause that was function
tracing. Unfortunately one thing that tends to happen with upstream is
that we get asked to do things that aren't directly related to the
project we are working on. We saw that at Intel quite often. I
referred to it at one point as the "you stepped in it, you own it"
phenomenon where if we even brushed against block of upstream code
that wasn't being well maintained we would be asked to fix it up and
address existing issues before we could upstream any patches.

> I think Leon's point is broadly that those on the "vendor" side seem
> to often be accused of being a "bad vendor". I couldn't help but
> notice the language from Meta on this thread seemed to place Meta
> outside of being a vendor, despite having always very much been doing
> typical vendor activities like downstream forks, proprietary userspace
> and now drivers for their own devices.

I wouldn't disagree that we are doing "vendor" things. Up until about
4 years ago I was on the "vendor" side at Intel. One difference is
that Meta is also the "consumer". So if I report an issue it is me
complaining about something as a sophisticated user instead of a
unsophisticated one. So hopefully we have gone though and done some
triage to at least bisect it down to a patch and are willing to work
with the community as you guys did. If we can work with the other
maintainers to enable them to debug and root cause the issue then even
better. The revert is normally the weapon of last resort to be broken
out before the merge window opens, or if an issue is caught in Linus's
tree.

> In my view the vendor/!vendor distinction is really toxic and should
> stop.

I agree. However that was essentially what started all this when Jiri
pointed out that we weren't selling the NIC to anyone else. That made
this all about vendor vs !vendor, and his suggestion of just giving
the NICs away isn't exactly practical. At least not an any sort of
large scale. Maybe we should start coming up with a new term for the
!vendor case. How about "prosumer", as in "producer" and "consumer"?