On Sat 09-01-21 11:46:46, Linus Torvalds wrote: > On Sat, Jan 9, 2021 at 11:33 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > > > On Thu, Jan 07, 2021 at 01:05:19PM -0800, Linus Torvalds wrote: > > > Side note, and not really related to UFFD, but the mmap_sem in > > > general: I was at one point actually hoping that we could make the > > > mmap_sem a spinlock, or at least make the rule be that we never do any > > > IO under it. At which point a write lock hopefully really shouldn't be > > > such a huge deal. > > > > There's a (small) group of us working towards that. It has some > > prerequisites, but where we're hoping to go currently: > > > > - Replace the vma rbtree with a b-tree protected with a spinlock > > - Page faults walk the b-tree under RCU, like peterz/laurent's SPF patchset > > - If we need to do I/O, take a refcount on the VMA > > > > After that, we can gradually move things out from mmap_sem protection > > to just the vma tree spinlock, or whatever makes sense for them. In a > > very real way the mmap_sem is the MM layer's BKL. > > Well, we could do the "no IO" part first, and keep the semaphore part. > > Some people actually prefer a semaphore to a spinlock, because it > doesn't end up causing preemption issues. > > As long as you don't do IO (or memory allocations) under a semaphore > (ok, in this case it's a rwsem, same difference), it might even be > preferable to keep it as a semaphore rather than as a spinlock. > > So it doesn't necessarily have to go all the way - we _could_ just try > something like "when taking the mmap_sem, set a thread flag" and then > have a "warn if doing allocations or IO under that flag". > > And since this is about performance, not some hard requirement, it > might not even matter if we catch all cases. If we fix it so that any > regular load on most normal filesystems never see the warning, we'd > already be golden. Honestly, I'd *love* if a filesystem can be guaranteed that ->fault and ->mkwrite callbacks do not happen under mmap_sem (or if at least fs would be free to drop mmap_sem if it finds the page is not already cached / prepared for writing). Because for filesystems the locking of page fault is really painful as the lock ordering wrt mmap_sem is exactly oposite compared to read / write path (read & write path must be designed so that mmap_sem can be taken inside it to copy user data, fault path may be all happening under mmap_sem). As a result this has been a long term source of deadlocks, stale data exposure issues, and filesystem corruption issues due to insufficient locking for multiple filesystems. But when I was looking at what it would take to achieve this several years ago, fixing all GUP users to deal with mmap_sem being dropped during a fault was a gigantic task because there were users of GUP relying on mmap_sem being held for large code sections around the GUP call... Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR