On Mon, Jun 05, 2023 at 02:57:45PM -0700, Darrick J. Wong wrote: > On Mon, Jun 05, 2023 at 03:27:43PM +0200, Thorsten Leemhuis wrote: > > /me waves friendly > > > > On 18.04.23 06:56, Darrick J. Wong wrote: > > > On Mon, Apr 17, 2023 at 01:16:53PM +0200, Linux regression tracking (Thorsten Leemhuis) wrote: > > >> Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting > > >> for once, to make this easily accessible to everyone. > > >> > > >> Has any progress been made to fix below regression? It doesn't look like > > >> it from here, hence I wondered if it fall through the cracks. Or is > > >> there some good reason why this is safe to ignore? > > > > > > Still working on thinking up a reasonable strategy to reload the incore > > > iunlink list if we trip over this. Online repair now knows how to do > > > this[1], but I haven't had time to figure out if this will work > > > generally. [...] > > > > I still have this issue on my list of tracked regressions, hence please > > allow me to ask: was there any progress to resolve this? Doesn't look > > like it, but from my point it's easy to miss something. > > Yeah -- Dave put "xfs: collect errors from inodegc for unlinked inode > recovery" in for-next yesterday, and I posted a draft of online repair > for the unlinked lists that corrects most of the other problems that we > found in the process of digging into this problem: > https://lore.kernel.org/linux-xfs/168506068642.3738067.3524976114588613479.stgit@frogsfrogsfrogs/T/#m861e4b1259d9b16b9970e46dfcfdae004a5dd634 > > But that's looking at things from the ground up, which isn't terribly > insightful as to what's going on, as you've noted. :) > > > BTW, in case this was not yet addressed: if you have a few seconds, > > could you please (just briefly!) explain why it seems to take quite a > > while to resolve this? A "not booting" regressions sounds like something > > that I'm pretty sure Linus normally wants to see addressed rather sooner > > than later. But that apparently is not the case here. I know that XFS > > devs normally take regressions seriously, hence I assume there are good > > reasons for it. But I'd like to roughly understand them (is this a > > extremely corner case issue others are unlike to run into or something > > like that?), as I don't want Linus on my back with questions like "why > > didn't you put more pressure on the XFS maintainers" or "you should have > > told me about this". > > First things first -- Ritesh reported problems wherein a freshly mounted > filesystem would fail soon after because of some issue or other with the > unlinked inode list. He could reproduce this problem, but (AFAIK) he's > the only user who's actually reported this. It's not like *everyone* > with XFS cannot boot anymore, it's just this system. Given the sparsity > of any other reports with similar symptoms, I do not judge this to be > a hair-on-fire situation. > > (Contrast this to the extent busy deadlock problem that Wengang Wang is > trying to solve, which (a) is hitting many customer systems and (b) > regularly. Criteria like (a) make things like that a higher severity > problem IMHO.) Contrast this to the regression from 6.3-rc1 that caused actual user data loss and filesystem corruption after 6.3 was released. https://bugzilla.redhat.com/show_bug.cgi?id=2208553 That's so much more important than any problem seen in a test environment it's not funny. We'd already fixed this regression that caused it in 6.4-rc1 - the original bug report (a livelock in data writeback) happened 2 days before 6.3 released. It took me 3 days from report to having a fix out for review (remember that timeframe). At the time we didn't recognise the wider corruption risk the failure we found exposed us to, so we didn't push it to stable immediately. Hence when users started tripping over corruption and I triaged it down to misdirected data write from -somewhere-. Eric then found a reproducer and bisected to a range of XFS changes, and I then realised what the problem was.... Stuff like this takes days of effort and multiple people to get to the bottom of, and -everything else- gets ignored while we work through the corruption problem. Thorsten, I'm betting that you didn't even know about this regression - it's been reported, tracked, triaged and fixed completely outside the scope and visibility of the "kernel regression tracker". Which clearly shows that we don't actually need some special kernel regression tracker infrastructure to do our jobs properly, nor do we need a nanny to make sure we actually are prioritising things correctly.... .... > Dave's patch addresses #5 by plumbing error returns up the stack so that > frontend processes that push the background gc threads can receive > errors and throw them out to userspace. Right, the inodegc regression fix simply restored the previous status quo. Nothing more, nothing less, exactly what we want regression fixes to do. But it took some time for me to get to because there were much higher priority events occurring.... .... > The problem with putting this in online repair is that Dave (AFAIK) > feels very strongly that every bug report needs to be triaged > immediately, and that takes priority over reviewing new code such as > online repair. My rationale is that we can ignore it once we know the scope of the issue, but until we know that information the risk of being unprepared for sudden escalation is rather high and that's even worse for stress and burnout levels. The fedora corruption bug I mention above is a canonical example of why triaging bug reports immediately is important - the original bug report was clearly somethign that needed to be fixed straight away, regardless of the fact we didn't know it could cause misdirected writes at the time. Once we got far enough into the fedora bug report triage, I simply pointed the distro at the commit for them to test, and they did everything else. Once confirmation that it fixed the problem came in, I sent it immediately to the stable kernel maintainers. IOWs, if I had not paid attention to the original bug report, it would have taken me several more days to find the problem and fix it (remember it took me ~3 days from report to fix originally). Users would have been exposed to the corruption bug for much longer than they were, and that doesn't make a bad situation any better. And don't get me started on syzkaller and "security researchers" raising inappropriate CVEs.... So, yeah, immediate triage is pretty much required at this point for all bug reports because the downstream impacts of ignoring them is only causing more stress and burnout risk for lots more people. The number of downstream people pulled into (and still dealing with the fallout of) that recent, completely unnecessary CVE fire drill was just ... crazy. We can choose to ignore triaged bug reports if they aren't important enough to deal with immediately (like this unlinked inode list issue), but we can make the decision (and justify it) based on the knowledge we have rather instead of claiming ignorance. We're supposed to be professional engineers, yes? > That's the right thing to do, but every time someone > sends in some automated fuzzer report, it slows down online repair > review. This is why I'm burned out and cranky as hell about script > kiddies dumping zerodays on the list and doing no work to help us fix > the problems. Reality sucks, and I hate it too. We get handed all the shit sandwiches and everyone seems to expect that we will simply to eat them up without complaining. But it's not like we didn't expect it - upstream Linux development has always been a great big shit sandwich and it's not going to be changing any time soon.... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx