On Wed, Dec 11, 2024 at 7:20 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > On Wed, Dec 11, 2024 at 12:25 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > > > On Tue, Dec 10, 2024 at 03:37:50PM -0800, Suren Baghdasaryan wrote: > > > > > > Replace vm_lock with vm_refcnt. Replace vm_detached with vm_refcnt == 0 > > > > -- that is, attach sets refcount to 1 to indicate it is part of the mas, > > > > detached is the final 'put'. > > > > > > I need to double-check if we ever write-lock a detached vma. I don't > > > think we do but better be safe. If we do then that wait-until() should > > > accept 0x8000'0001 as well. > > > > vma_start_write() > > __is_vma_write_locked() > > mmap_assert_write_locked(vma->vm_mm); > > > > So this really should hold afaict. > > > > > > RCU lookup does the inc_not_zero thing, when increment succeeds, compare > > > > mm/addr to validate. > > > > > > > > vma_start_write() already relies on mmap_lock being held for writing, > > > > and thus does not have to worry about writer-vs-writer contention, that > > > > is fully resolved by mmap_sem. This means we only need to wait for > > > > readers to drop out. > > > > > > > > vma_start_write() > > > > add(0x8000'0001); // could fetch_add and double check the high > > > > // bit wasn't already set. > > > > wait-until(refcnt == 0x8000'0002); // mas + writer ref > > > > WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq); > > > > sub(0x8000'0000); > > > > > > > > vma_end_write() > > > > put(); > > > > > > We don't really have vma_end_write(). Instead it's vma_end_write_all() > > > which increments mm_lock_seq unlocking all write-locked VMAs. > > > Therefore in vma_start_write() I think we can sub(0x8000'0001) at the > > > end. > > > > Right, I know you don't, but you should :-), I've suggested adding this > > before. > > I'll look into adding it. IIRC there was some issue with that but I > can't recall... > > > > > > > vma_start_read() then becomes something like: > > > > > > > > if (vm_lock_seq == mm_lock_seq) > > > > return false; > > > > > > > > cnt = fetch_inc(1); > > > > if (cnt & msb || vm_lock_seq == mm_lock_seq) { > > > > put(); > > > > return false; > > > > } > > > > > > > > return true; > > > > > > > > vma_end_read() then becomes: > > > > put(); > > > > > > > > > > > > and the down_read() from uffffffd requires mmap_read_lock() and thus > > > > does not have to worry about writers, it can simpy be inc() and put(), > > > > no? > > > > > > I think your proposal should work. Let me try to code it and see if > > > something breaks. Ok, I tried it out and things are a bit more complex: 1. We should allow write-locking a detached VMA, IOW vma_start_write() can be called when vm_refcnt is 0. In that situation add(0x8000'0001) leads to "addition on 0; use-after-free". Maybe I can introduce a new refcnt function which does not complain when adding to 0? 2. Adding 0x80000000 saturates refcnt, so I have to use a lower bit 0x40000000 to denote writers. 3. Currently vma_mark_attached() can be called on an already attached VMA. With vma->detached being a separate attribute that works fine but when we combine it with the vm_lock things break (extra attach would leak into lock count). I'll see if I can catch all the cases when we do this and clean them up (not call vma_mark_attached() when not necessary). > > > > Btw, for the wait-until() and put() you can use rcuwait; that is the > > simplest wait form we have. It's suitable because we only ever have the > > one waiter. > > Yes, Davidlohr mentioned that before. I'll do that. Thanks!