Re: [GIT PULL] bcachefs updates for 6.8

"Theodore Ts'o" <tytso@xxxxxxx> · Wed, 17 Jan 2024 00:54:57 -0500

On Tue, Jan 16, 2024 at 11:41:25PM -0500, Kent Overstreet wrote:
> > > No, it's a leadership/mentorship thing.
> > > 
> > > And this is something that's always been lacking in kernel culture.
> > > Witness the kind of general grousing that goes on at maintainer summits;
> > > maintainers complain about being overworked and people not stepping up
> > > to help with the grungy responsibilities, while simultaneously we still

     <blah blah blah>

> > > Tests and test infrastructure fall into the necessary but not fun
> > > category, so they languish.
> > 
> > No, they fall into the "no company wants to pay someone to do the work"
> > category, so it doesn't get done.
> > 
> > It's not a "leadership" issue, what is the "leadership" supposed to do
> > here, refuse to take any new changes unless someone ponys up and does
> > the infrastructure and testing work first?  That's not going to fly, for
> > valid reasons.

Greg is absolutely right about this.

> But good tools are important beacuse they affect the rate of everyday
> development; they're a multiplier on the money everone is spending on
> salaries.

Alas, companies don't see it that way.  They take the value that get
from Linux for granted, and they only care about the multipler effect
of their employees salaries (and sometimes not even that).  They most
certainly care about the salutary effects on the entire ecosyustem.
At least, I haven't seen any company make funding decisions on that
basis.

It's easy enough for you to blame "leadership", but the problem is the
leaders at the VP and SVP level who control the budgets, not the
leadership of the maintainers, who are overworked, and who often
invest in testing themselves, on their own personal time, because they
don't get adequate support from others.

It's also for that reason why we try to prove that people won't just
stick around enough for their pet feature (or in the case of ntfs,
their pet file system) gets into the kernel --- and then disappear.
For too often, this is what happens, either because they have their
itch scratched, or their company reassigns them to some other project
that is important for their company's bottom-line.

If that person is willing their own personal time, long after work
hours, to steward their contribution in the absence of corporate
support, great.  But we need to have that proven to us, or at the very
least, make sure the feature's long-term maintenace burden is as low
possible, to mitigate the likelihood that we won't see the new
engineer after their feature lands upstream.

> Having one common way of running all our functional VM tests, and a
> common collection of those tests would be a huge win for productivity
> because _way_ too many developers are still using slow ad hoc testing
> methods, and a good test runner (ktest) gets the edit/compile/test cycle
> down to < 1 minute, with the same tests framework for local development
> and automated testing in the big test cloud...

I'm going to call bullshit on this assertion.  The fact that we have
multiple ways of running our tests is not the reason why testing takes
a long time.

If you are going to run stress tests, which is critical for testing
real file systems, that's going to take at least an hour; more if you
want to test muliple file system features.  The full regression set
for ext4, using the common fstests testt suite, takes about 25 hours
of VM time; and about 2.5 hours of wall clock time since I shard it
across a dozen VM's.

Yes, w could try to add some unit tests which take much less time
running tests where fstests is creating a file system, mounting it,
exercising the code through userspace functions, and then unmounting
the file system and then checking the file system.  Even if that were
an adequate replacement for some of the existing fstests, (a) it's not
a replacement for stress testing, and (b) this would require a vast
amount of file system specific software engineering investment, and
where is that going from?

The bottom line is that problem is that having a one common way of
running our functional VM tests is not even *close* to root cause of
the problem.

	    	       	  	       - Ted