On Thu, Jan 30, 2025 at 08:04:49AM -0800, Dave Hansen wrote: > On 1/29/25 23:44, Kent Overstreet wrote: > > On Wed, Jan 29, 2025 at 10:17:49AM -0800, Dave Hansen wrote: > >> tl;dr: The VFS and several filesystems have some suspect prefaulting > >> code. It is unnecessarily slow for the common case where a write's > >> source buffer is resident and does not need to be faulted in. > >> > >> Move these "prefaulting" operations to slow paths where they ensure > >> forward progress but they do not slow down the fast paths. This > >> optimizes the fast path to touch userspace once instead of twice. > >> > >> Also update somewhat dubious comments about the need for prefaulting. > >> > >> This has been very lightly tested. I have not tested any of the fs/ > >> code explicitly. > > > > Q: what is preventing us from posting code to the list that's been > > properly tested? > > > > I just got another bcachefs patch series that blew up immediately when I > > threw it at my CI. > > > > This is getting _utterly ridiculous_. > > In this case, I started with a single patch for generic code that I knew > I could test. In fact, I even had the 9-year-old binary sitting on my > test box. > > Dave Chinner suggested that I take the generic pattern go look a _bit_ > more widely in the tree for a similar pattern. That search paid off, I > think. But I ended up touching corners of the tree I don't know well and > don't have test cases for. That's all well and good, but the testing thing is really coming to a head. I have enough on my plate without back-and-forth > > I built multiuser test infrastructure with a nice dashboard that anyone > > can use, and the only response I've gotten from the old guard is Ted > > jumping in every time I talk about it to say "no, we just don't want to > > rewrite our stuff on _your_ stuff!". Real helpful, that. > > Sounds pretty cool! Is this something that I could have and should have > used to test the bcachefs patch? I see some trees in here: > > https://evilpiepirate.org/~testdashboard/ci > > But I'm not sure how to submit patches to it. Do you need to add users > manually? I wonder, though, how we could make it easier to find. I > didn't see anything Documentation/filesystems/bcachefs/ about this. Yes, I give out user accounts and that gives you a config file you can edit to specify branches to test and tests to run; it then automatically watches those branch(es). Here's the thing though, the servers cost real money. I give out accounts to community members (and people working on bcachefs are using it and it's working wel), but if you've got a big tech company email address you (or your managers) will have to be pitching in. Not subsidizing you guys :) > > >> 1. Deadlock avoidance if the source and target are the same > >> folios. > >> 2. To check the user address that copy_folio_from_iter_atomic() > >> will touch because atomic user copies do not check the address. > >> 3. "Optimization" > >> > >> I'm not sure any of these are actually valid reasons. > >> > >> The "atomic" user copy functions disable page fault handling because > >> page faults are not very atomic. This makes them naturally resistant > >> to deadlocking in page fault handling. They take the page fault > >> itself but short-circuit any handling. > > > > #1 is emphatically valid: the deadlock avoidance is in _both_ using > > _atomic when we have locks held, and doing the actual faulting with > > locks dropped... either alone would be a buggy incomplete solution. > > I was (badly) attempting to separate out the two different problems: > > 1. Doing lock_page() twice, which I was mostly calling the > "deadlock" > 2. Retrying the copy_folio_from_iter_atomic() forever which I > was calling the "livelock" > > Disabling page faults fixes #1. > Doing faulting outside the locks somewhere fixes #2. > > So when I was talking about "Deadlock avoidance" in the cover letter, I > was trying to focus on the double lock_page() problem. > > > This needs to be reflected and fully described in the comments, since > > it's subtle and a lot of people don't fully grok what's going on. > > Any suggestions for fully describing the situation? I tried to sprinkle > comments liberally but I'm also painfully aware that I'm not doing a > perfect job of talking about the fs code. The critical thing to cover is the fact that mmap means that page faults can recurse into arbitrary filesystem code, thus blowing a hole in all our carefully crafted lock ordering if we allow that while holding locks - you didn't mention that at all. > > I'm fairly certain we have ioctl code where this is mishandled and thus > > buggy, because it takes some fairly particular testing for lockdep to > > spot it. > > Yeah, I wouldn't be surprised. It was having a little chuckle thinking > about how many engineers have discovered and fixed this problem > independently over the years in all the file system code in all the OSes. Oh, this is the easy one - mmap and dio is where it really gets into "stories we tell young engineers to keep them awake at night". > > >> copy_folio_from_iter_atomic() also *does* have user address checking. > >> I get a little lost in the iov_iter code, but it does know when it's > >> dealing with userspace versus kernel addresses and does seem to know > >> when to do things like copy_from_user_iter() (which does access_ok()) > >> versus memcpy_from_iter().[1] > >> > >> The "optimization" is for the case where 'source' is not faulted in. > >> It can avoid the cost of a "failed" page fault (it will fail to be > >> handled because of the atomic copy) and then needing to drop locks and > >> repeat the fault. > > > > I do agree on moving it to the slowpath - I think we can expect the case > > where the process's immediate workingset is faulted out while it's > > running to be vanishingly small. > > Great! I'm glad we're on the same page there. > > For bcachefs specifically, how should we move forward? If you're happy > with the concept, would you prefer that I do some manual bcachefs > testing? Or leave a branch sitting there for a week and pray the robots > test it? No to the sit and pray. If I see one more "testing? that's something other people do" conversation I'll blow another gasket. xfstests supports bcachefs, and if you need a really easy way to run it locally on all the various filesystems, I have a solution for that: https://evilpiepirate.org/git/ktest.git/ If you want access to my CI that runs all that in parallel across 120 VMs with the nice dashboard - shoot me an email and I'll outline server costs and we can work something out.