Re: [GIT PULL] bcachefs fixes for 6.12-rc2

Alan Huang <mmpgouride@xxxxxxxxx> · Mon, 7 Oct 2024 05:31:27 +0800

On Oct 7, 2024, at 03:29, Kent Overstreet <kent.overstreet@xxxxxxxxx> wrote:
> 
> On Sun, Oct 06, 2024 at 12:04:45PM GMT, Linus Torvalds wrote:
>> On Sat, 5 Oct 2024 at 21:33, Kent Overstreet <kent.overstreet@xxxxxxxxx> wrote:
>>> 
>>> On Sun, Oct 06, 2024 at 12:30:02AM GMT, Theodore Ts'o wrote:
>>>> 
>>>> You may believe that yours is better than anyone else's, but with
>>>> respect, I disagree, at least for my own workflow and use case.  And
>>>> if you look at the number of contributors in both Luis and my xfstests
>>>> runners[2][3], I suspect you'll find that we have far more
>>>> contributors in our git repo than your solo effort....
>>> 
>>> Correct me if I'm wrong, but your system isn't available to the
>>> community, and I haven't seen a CI or dashboard for kdevops?
>>> 
>>> Believe me, I would love to not be sinking time into this as well, but
>>> we need to standardize on something everyone can use.
>> 
>> I really don't think we necessarily need to standardize. Certainly not
>> across completely different subsystems.
>> 
>> Maybe filesystem people have something in common, but honestly, even
>> that is rather questionable. Different filesystems have enough
>> different features that you will have different testing needs.
>> 
>> And a filesystem tree and an architecture tree (or the networking
>> tree, or whatever) have basically almost _zero_ overlap in testing -
>> apart from the obvious side of just basic build and boot testing.
>> 
>> And don't even get me started on drivers, which have a whole different
>> thing and can generally not be tested in some random VM at all.
> 
> Drivers are obviously a whole different ballgame, but what I'm after is
> more
> - tooling the community can use
> - some level of common infrastructure, so we're not all rolling our own.
> 
> "Test infrastructure the community can use" is a big one, because
> enabling the community and making it easier for people to participate
> and do real development is where our pipeline of new engineers comes
> from.

Yeah, the CI is really helpful, at least for those who want to get involved in
the development of bcachefs. As a new comer, I’m not at all interested in setting up
a separate testing environment at the very beginning, which might be time-consuming
and costly.

> 
> Over the past 15 years, I've seen the filesystem community get smaller
> and older, and that's not a good thing. I've had some good success with
> giving ktest access to people in the community, who then start using it
> actively and contributing (small, so far) patches (and interesting, a
> lot of the new activity is from China) - this means they can do
> development at a reasonable pace and I don't have to look at their code
> until it's actually passing all the tests, which is _huge_.
> 
> And filesystem tests take overnight to run on a single machine, so
> having something that gets them results back in 20 minutes is also huge.

Exactly, I can verify some ideas very quickly with the help of the CI.

So, a big thank you for all the effort you've put into it!

> 
> The other thing I'd really like is to take the best of what we've got
> for testrunner/CI dashboard (and opinions will vary, but of course I
> like ktest the best) and make it available to other subsystems (mm,
> block, kselftests) because not everyone has time to roll their own.
> 
> That takes a lot of facetime - getting to know people's workflows,
> porting tests - so it hasn't happened as much as I'd like, but it's
> still an active interest of mine.
> 
>> So no. People should *not* try to standardize on something everyone can use.
>> 
>> But _everybody_ should participate in the basic build testing (and the
>> basic boot testing we have, even if it probably doesn't exercise much
>> of most subsystems).  That covers a *lot* of stuff that various
>> domain-specific testing does not (and generally should not).
>> 
>> For example, when you do filesystem-specific testing, you very seldom
>> have much issues with different compilers or architectures. Sure,
>> there can be compiler version issues that affect behavior, but let's
>> be honest: it's very very rare. And yes, there are big-endian machines
>> and the whole 32-bit vs 64-bit thing, and that can certainly affect
>> your filesystem testing, but I would expect it to be a fairly rare and
>> secondary thing for you to worry about when you try to stress your
>> filesystem for correctness.
> 
> But - a big gap right now is endian /portability/, and that one is a
> pain to cover with automated tests because you either need access to
> both big and little endian hardware (at a minumm for creating test
> images), or you need to run qemu in full-emulation mode, which is pretty
> unbearably slow.
> 
>> But build and boot testing? All those random configs, all those odd
>> architectures, and all those odd compilers *do* affect build testing.
>> So you as a filesystem maintainer should *not* generally strive to do
>> your own basic build test, but very much participate in the generic
>> build test that is being done by various bots (not just on linux-next,
>> but things like the 0day bot on various patch series posted to the
>> list etc).
>> 
>> End result: one size does not fit all. But I get unhappy when I see
>> some subsystem that doesn't seem to participate in what I consider the
>> absolute bare minimum.
> 
> So the big issue for me has been that with the -next/0day pipeline, I
> have no visibility into when it finishes; which means it has to go onto
> my mental stack of things to watch for and becomes yet another thing to
> pipeline, and the more I have to pipeline the more I lose track of
> things.
> 
> (Seriously: when I am constantly tracking 5 different bug reports and
> talking to 5 different users, every additional bit of mental state I
> have to remember is death by a thousand cuts).
> 
> Which would all be solved with a dashboard - which is why adding the
> bulid testing to ktest (or ideally, stealing _all_ the 0day tests for
> ktest) is becoming a bigger and bigger priority.
> 
>> Btw, there are other ways to make me less unhappy. For example, a
>> couple of years ago, we had a string of issues with the networking
>> tree. Not because there was any particular maintenance issue, but
>> because the networking tree is basically one of the biggest subsystems
>> there are, and so bugs just happen more for that simple reason. Random
>> driver issues that got found resolved quickly, but that kept happening
>> in rc releases (or even final releases).
>> 
>> And that was *despite* the networking fixes generally having been in linux-next.
> 
> Yeah, same thing has been going on in filesystem land, which is why now
> have fs-next that we're supposed to be targeting our testing automation
> at.
> 
> That one will likely come slower for me, because I need to clear out a
> bunch of CI failing tests before I'll want to look at that, but it's on
> my radar.
> 
>> Now, the reason I mention the networking tree is that the one simple
>> thing that made it a lot less stressful was that I asked whether the
>> networking fixes pulls could just come in on Thursday instead of late
>> on Friday or Saturday. That meant that any silly things that the bots
>> picked up on (or good testers picked up on quickly) now had an extra
>> day or two to get resolved.
> 
> Ok, if fixes coming in on Saturday is an issue for you that's something
> I can absolutely change. The only _critical_ one for rc2 was the
> __wait_for_freeing_inode() fix (which did come in late), the rest
> could've waited until Monday.
> 
>> Now, it may be that the string of unfortunate networking issues that
>> caused this policy were entirely just bad luck, and we just haven't
>> had that. But the networking pull still comes in on Thursdays, and
>> we've been doing it that way for four years, and it seems to have
>> worked out well for both sides. I certainly feel a lot better about
>> being able to do the (sometimes fairly sizeable) pull on a Thursday,
>> knowing that if there is some last-minute issue, we can still fix just
>> *that* before the rc or final release.
>> 
>> And hey, that's literally just a "this was how we dealt with one
>> particular situation". Not everybody needs to have the same rules,
>> because the exact details will be different. I like doing releases on
>> Sundays, because that way the people who do a fairly normal Mon-Fri
>> week come in to a fresh release (whether rc or not). And people tend
>> to like sending in their "work of the week" to me on Fridays, so I get
>> a lot of pull requests on Friday, and most of the time that works just
>> fine.
>> 
>> So the networking tree timing policy ended up working quite well for
>> that, but there's no reason it should be "The Rule" and that everybody
>> should do it. But maybe it would lessen the stress on both sides for
>> bcachefs too if we aimed for that kind of thing?
> 
> Yeah, that sounds like the plan then.