Re: [GIT PULL] bcachefs fixes for 6.12-rc5

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Thu, 24 Oct 2024 18:19:11 -0400

On Wed, Oct 23, 2024 at 07:41:32AM -0400, Sasha Levin wrote:
> On Wed, Oct 23, 2024 at 09:42:59PM +1100, Michael Ellerman wrote:
> > Hi Sasha,
> > 
> > This is awesome.
> > 
> > Sasha Levin <sashal@xxxxxxxxxx> writes:
> > > On Tue, Oct 22, 2024 at 01:49:31PM -0700, Darrick J. Wong wrote:
> > > > On Tue, Oct 22, 2024 at 03:06:38PM -0400, Sasha Levin wrote:
> > > > > other information that would be useful?
> > > > 
> > > > As a maintainer I probably would've found this to be annoying, but with
> > > > all my other outside observer / participant hats on, I think it's very
> > > > good to have a bot to expose maintainers not following the process.
> > > 
> > > This was my thinking too. Maybe it makes sense for the bot to shut up if
> > > things look good (i.e. >N days in stable, everything on the mailing
> > > list). Or maybe just a simple "LGTM" or a "Reviewed-by:..."?
> > 
> > I think it has to reply with something, otherwise folks will wonder if
> > the bot has broken or missed their pull request.
> > 
> > But if all commits were in in linux-next and posted to a list, then the
> > only content is the "Days in linux-next" histogram, which is not that long
> > and is useful information IMHO.
> > 
> > It would be nice if you could trim the tail of the histogram below the
> > last populated row, that would make it more concise.
> 
> Makes sense, I'll do that.
> 
> > For fixes pulls it is sometimes legitimate for commits not to have been
> > in linux-next. But I think it's still good for the bot to highlight
> > those, ideally fixes that miss linux-next are either very urgent or
> > minor.
> 
> Right, and Linus said he's okay with those. This is not a "shame" list
> but rather "look a little closer" list.

Ok, that makes me feel better about this. I've got stuff that I hold
back for weeks (or months), and others that I'm fine with sending the
next day, once it's passed my CI.

I'm going to try to be better about talking about which patches have
risks (and why that risk is justified, else I wouldn't be sending it),
or which patches look more involved but I've got reason to be confident
about - that can get quite subtle. That's good for everyone down the
line in terms of knowing what to expect.

I wonder if also some of this is motivated by people concerned about
things in bcachefs moving too fast, and running the risk of regressions?
That's a justifiable concern, and priorities might be worth talking
about a bit.

I'm not currently seeing anything that makes me too concerned about
regressions: users in general aren't complaining about regressions
(previous pull request a user chimed in that he had been seeing less
stability past few kernel releases, and he tried switching back to btrfs
and the issues were still there) and my test dashboard is steadily
improving.

I do still have fairly critical (i.e. downtime causing) user reported
issues coming in that are taking most of my time, although they're
getting off into the weeds - one I've been working on the past few days
was reported by a user with a ~20 drive filesystem where we're
overflowing the maximum number of pointers in an extent, due to keeping
too many cached copies around, and his filesystem goes emergency
read-only. And there still seems to be something not-quite-right with
snapshots and unlinked inodes, possibly a repair issue.

Test dashboard still has a long ways to go before it's anywhere near as
clean as I want (and it needs to be, so that I can easily spot
regressions), but number of test failures have been steadily dropping
and the results are getting more consistent, and none of the test
failures there are scary ones that need to be jumped on.

https://evilpiepirate.org/~testdashboard/ci?user=kmo&branch=bcachefs-testing

(We were at 150-160 failures per full run a few weeks ago, now 100-110.
The full runs also run fstests in 8 different configurations, so lots of
duplicates).

There have been a lot of new syzbot reports recently, and some of those
do look more concerning. I don't think this indicates regressions - this
looks to be like syzbot getting better with its code-coverage-guided
testing at finding interesting codepaths, and for the concerning ones I
don't think the code changed in the right timeframe for it to be a
regressions. I think several of the recent syzbot reports are all due to
the same bug, where it looks like we're not going read-only correctly
and interior btree updates are still happening - I suspect that's been
there for ages and an assert I recently added is making it more visible.

I am still heavily in triage mode, and filesystem repair/recovery bugs
take top priority. In general, my priorities are
- critical user reported bugs first
- failures from my automated test suite second (unless they're
  regressions)
- syzbot last, because there's been basically zero overlap with syzbot
  bugs and user-affecting bugs so far, and because that's been an easy
  place for new people to jump in and help.