Re: xfs: system fails to boot up due to Internal error xfs_trans_cancel

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 6 Jun 2023 12:46:15 +1000

On Mon, Jun 05, 2023 at 02:57:45PM -0700, Darrick J. Wong wrote:
> On Mon, Jun 05, 2023 at 03:27:43PM +0200, Thorsten Leemhuis wrote:
> > /me waves friendly
> > 
> > On 18.04.23 06:56, Darrick J. Wong wrote:
> > > On Mon, Apr 17, 2023 at 01:16:53PM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
> > >> Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
> > >> for once, to make this easily accessible to everyone.
> > >>
> > >> Has any progress been made to fix below regression? It doesn't look like
> > >> it from here, hence I wondered if it fall through the cracks. Or is
> > >> there some good reason why this is safe to ignore?
> > > 
> > > Still working on thinking up a reasonable strategy to reload the incore
> > > iunlink list if we trip over this.  Online repair now knows how to do
> > > this[1], but I haven't had time to figure out if this will work
> > > generally.  [...]
> > 
> > I still have this issue on my list of tracked regressions, hence please
> > allow me to ask: was there any progress to resolve this? Doesn't look
> > like it, but from my point it's easy to miss something.
> 
> Yeah -- Dave put "xfs: collect errors from inodegc for unlinked inode
> recovery" in for-next yesterday, and I posted a draft of online repair
> for the unlinked lists that corrects most of the other problems that we
> found in the process of digging into this problem:
> https://lore.kernel.org/linux-xfs/168506068642.3738067.3524976114588613479.stgit@frogsfrogsfrogs/T/#m861e4b1259d9b16b9970e46dfcfdae004a5dd634
> 
> But that's looking at things from the ground up, which isn't terribly
> insightful as to what's going on, as you've noted. :)
> 
> > BTW, in case this was not yet addressed: if you have a few seconds,
> > could you please (just briefly!) explain why it seems to take quite a
> > while to resolve this? A "not booting" regressions sounds like something
> > that I'm pretty sure Linus normally wants to see addressed rather sooner
> > than later. But that apparently is not the case here. I know that XFS
> > devs normally take regressions seriously, hence I assume there are good
> > reasons for it. But I'd like to roughly understand them (is this a
> > extremely corner case issue others are unlike to run into or something
> > like that?), as I don't want Linus on my back with questions like "why
> > didn't you put more pressure on the XFS maintainers" or "you should have
> > told me about this".
> 
> First things first -- Ritesh reported problems wherein a freshly mounted
> filesystem would fail soon after because of some issue or other with the
> unlinked inode list.  He could reproduce this problem, but (AFAIK) he's
> the only user who's actually reported this.  It's not like *everyone*
> with XFS cannot boot anymore, it's just this system.  Given the sparsity
> of any other reports with similar symptoms, I do not judge this to be
> a hair-on-fire situation.
> 
> (Contrast this to the extent busy deadlock problem that Wengang Wang is
> trying to solve, which (a) is hitting many customer systems and (b)
> regularly.  Criteria like (a) make things like that a higher severity
> problem IMHO.)

Contrast this to the regression from 6.3-rc1 that caused actual user
data loss and filesystem corruption after 6.3 was released.

https://bugzilla.redhat.com/show_bug.cgi?id=2208553

That's so much more important than any problem seen in a test
environment it's not funny.

We'd already fixed this regression that caused it in 6.4-rc1 - the
original bug report (a livelock in data writeback) happened 2 days
before 6.3 released. It took me 3 days from report to having a fix
out for review (remember that timeframe).

At the time we didn't recognise the wider corruption risk the
failure we found exposed us to, so we didn't push it to stable
immediately. Hence when users started tripping over corruption and I
triaged it down to misdirected data write from -somewhere-. Eric
then found a reproducer and bisected to a range of XFS changes, and
I then realised what the problem was....

Stuff like this takes days of effort and multiple people to get to
the bottom of, and -everything else- gets ignored while we work
through the corruption problem.

Thorsten, I'm betting that you didn't even know about this
regression - it's been reported, tracked, triaged and fixed
completely outside the scope and visibility of the "kernel
regression tracker". Which clearly shows that we don't actually need
some special kernel regression tracker infrastructure to do our jobs
properly, nor do we need a nanny to make sure we actually are
prioritising things correctly....

....

> Dave's patch addresses #5 by plumbing error returns up the stack so that
> frontend processes that push the background gc threads can receive
> errors and throw them out to userspace.

Right, the inodegc regression fix simply restored the previous
status quo.  Nothing more, nothing less, exactly what we want
regression fixes to do. But it took some time for me to get to
because there were much higher priority events occurring....

....

> The problem with putting this in online repair is that Dave (AFAIK)
> feels very strongly that every bug report needs to be triaged
> immediately, and that takes priority over reviewing new code such as
> online repair.

My rationale is that we can ignore it once we know the scope of the
issue, but until we know that information the risk of being
unprepared for sudden escalation is rather high and that's even
worse for stress and burnout levels.

The fedora corruption bug I mention above is a canonical example of
why triaging bug reports immediately is important - the original bug
report was clearly somethign that needed to be fixed straight away,
regardless of the fact we didn't know it could cause misdirected
writes at the time.

Once we got far enough into the fedora bug report triage, I simply
pointed the distro at the commit for them to test, and they did
everything else. Once confirmation that it fixed the problem came
in, I sent it immediately to the stable kernel maintainers.

IOWs, if I had not paid attention to the original bug report, it
would have taken me several more days to find the problem and fix it
(remember it took me ~3 days from report to fix originally).  Users
would have been exposed to the corruption bug for much longer than
they were, and that doesn't make a bad situation any better.

And don't get me started on syzkaller and "security researchers"
raising inappropriate CVEs....

So, yeah, immediate triage is pretty much required at this point for
all bug reports because the downstream impacts of ignoring them is
only causing more stress and burnout risk for lots more people. The
number of downstream people pulled into (and still dealing with the
fallout of) that recent, completely unnecessary CVE fire drill was
just ... crazy.

We can choose to ignore triaged bug reports if they aren't important
enough to deal with immediately (like this unlinked inode list
issue), but we can make the decision (and justify it) based on the
knowledge we have rather instead of claiming ignorance. We're
supposed to be professional engineers, yes?

> That's the right thing to do, but every time someone
> sends in some automated fuzzer report, it slows down online repair
> review.  This is why I'm burned out and cranky as hell about script
> kiddies dumping zerodays on the list and doing no work to help us fix
> the problems.

Reality sucks, and I hate it too. We get handed all the shit
sandwiches and everyone seems to expect that we will simply to eat
them up without complaining. But it's not like we didn't expect it -
upstream Linux development has always been a great big shit sandwich
and it's not going to be changing any time soon....

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx