[Been on PTO this last week and a half] On Thu, Aug 10, 2023 at 11:45:26PM -0400, Kent Overstreet wrote: > On Thu, Aug 10, 2023 at 03:39:42PM -0700, Darrick J. Wong wrote: > > Nowadays we have all these people running bots and AIs throwing a steady > > stream of bug reports and CVE reports at Dave [Chinner] and I. Most of > > these people *do not* help fix the problems they report. Once in a > > while there's an actual *user* report about data loss, but those > > (thankfully) aren't the majority of the reports. > > > > However, every one of these reports has to be triaged, analyzed, and > > dealt with. As soon as we clear one, at least one more rolls in. You > > know what that means? Dave and I are both in a permanent state of > > heightened alert, fear, and stress. We never get to settle back down to > > calm. Every time someone brings up syzbot, CVEs, or security? I feel > > my own stress response ramping up. I can no longer have "rational" > > conversations about syzbot because those discussions push my buttons. > > > > This is not healthy! > > Yeah, we really need to take a step back and ask ourselves what we're > trying to do here. > > At this point, I'm not so sure hardening xfs/ext4 in all the ways people > are wanting them to be hardened is a realistic idea: these are huge, old > C codebases that are tricky to work on, and they weren't designed from > the start with these kinds of considerations. Yes, in a perfect world > all code should be secure and all bugs should be fixed, but is this the > way to do it? Look at it this way: For XFS we've already done the hardening work - we started that way back in 2008 when we started planning for the V5 filesystem format to avoid all the random bit failures that were occuring out there in the real world and from academic fuzzer research. The problem with syzbot has been that has been testing the old V4 format and it keeps tripping over different symptoms of the same problems the v5 format either isn't susceptible to or it detects and fixes/shuts down the filesystem. Since syzbot finally turned off v4 format testing on the 3rd July, we haven't had a single new syzbot bug report on XFS. I don't expect syzbot to find a significant amount of new issues on XFS from this point onwards... So, yeah, I think we did the bulk of the possible format verification/hardening work in XFS a decade ago, and the stream of bugs we've been seeing is due to intentionally ignoring the format that actually provides some defences against random bit manipulation based failures... > Personally, I think we'd be better served by putting what manpower we > can spare into starting on an incremental Rust rewrite; at least that's > my plan for bcachefs, and something I've been studying for awhile (as > soon as the gcc rust stuff lands I'll be adding Rust code to > fs/bcachefs, some code already exists). For xfs/ext4, teasing things > apart and figuring out how to restructure data structures in a way to > pass the borrow checker may not be realistic, I don't know the codebases > well enough to say - but clearly the current approach is not working, > and these codebases are almost definitely still going to be in use 50 > years from now, we need to be coming up with _some_ sort of plan. For XFS, my plan for the past couple of years has been to start with rewriting chunks of the userspace code in rust. That shares the core libxfs code with the kernel, so the idea is that we slowly reimplement bits of libxfs in userspace in rust where we have the freedom to just rip and tear the code apart. Then when we have something that largely works we can pull that core libxfs rust code back into the kernel as rust support improves. Of course, that's been largely put on the back burner over the past year or so because of all the other demands on my time stuff like dealing with 1-2 syzbot bug reports a week have resulted in.... > And if we had a coherent long term plan, maybe that would help with the > funding and headcount issues... I don't think a lack of a plan is the problem with funding and headcount. At it's core, the problem is inherent in the capitalism model that is "funding" the "community" - squeeze the most you can from as little as possible and externalise the costs as much as possible. Burning people out is essentially externalising the human cost of corporate bean counter optimisation of the bottom line... If I had a dollar for every time I'd been told "we don't have the money for more resources" whilst both company revenue and profits are continually going up, we could pay for another engineer... [....] > > A broader dynamic here is that I ask people to review the code so that I > > can merge it; they say they will do it; and then an entire cycle goes by > > without any visible progress. > > > > When I ask these people why they didn't follow through on their > > commitments, the responses I hear are pretty uniform -- they got buried > > in root cause analysis of a real bug report but lol there were no other > > senior people available; their time ended up being spent on backports or > > arguing about backports; or they got caught up in that whole freakout > > thing I described above. > > Yeah, that set of priorities makes sense when we're talking about > patches that modify existing code; if you can't keep up with bug reports > then you have to slow down on changes, and changes to existing code > often do need the meticulous review - and hopefully while people are > waiting on code review they'll be helping out with bug reports. > > But for new code that isn't going to upset existing users, if we trust > the author to not do crazy things then code review is really more about > making sure someone else understands the code. But if they're putting in > all the proper effort to document, to organize things well, to do things > responsibly, does it make sense for that level of code review to be an > up front requirement? Perhaps we could think a _bit_ more about how we > enable people to do good work. That's pretty much the review rationale I'm using for the online fsck code. I really only care in detail about how it interfaces with the core XFS infrastructure, and as long as the rest of it makes sense and doesn't make my eyes bleed then it's good enough. That doesn't change the fact that it takes me at least a week to read through 10,000 lines of code in sufficient rigour to form an opinion on it, and that's before I know what I need to look at in more detail. So however you look at it, even a "good enough" review of 50,000 lines of new code (the size of online fsck) still requires a couple of months of review time for someone who knows the subsystem intimately... > I'm sure the XFS people have thought about this more than I have, but > given how long this has been taking you and the amount of pushback I > feel it ought to be asked. Certainly we have, and for a long time the merge criteria for code that is tagged as EXPERIMENTAL has been lower (i.e. good enough) and that's how we merged things like the v5 format, reflink support, rmap support, etc without huge amounts of review friction. The problem is that drive over the past few years for more intense review because CI and bot driven testing with no regression policies has really pushed common sense out the window. These days it feels like we're only allowed to merge "perfect" code otherwise the code is "insecure" (e.g. the policies being advocated by syzbot developers). Hence review over the past few years got more finicky and picky because of the fear that regressions will be introduced with new code. This is a direct result of it being drummed into developers that regressions and CI failures must be avoided at -all costs-. i.e. the policies and testing infrastructure that is being used to "validate" these large software projects is pushing us hard towards the "code must be perfect at first attempt" side of the coin rather than more the more practical (and achievable) "good enough" bar. CI is useful and good for code quality, but common sense has to prevail at some point.... -Dave. -- Dave Chinner dchinner@xxxxxxxxxx