On Tue, Dec 10, 2024 at 2:39 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > On Tue, Nov 12, 2024 at 07:18:45AM -0800, Suren Baghdasaryan wrote: > > On Mon, Nov 11, 2024 at 8:58 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > > > > > On Mon, Nov 11, 2024 at 12:55:05PM -0800, Suren Baghdasaryan wrote: > > > > When a reader takes read lock, it increments the atomic, unless the > > > > top two bits are set indicating a writer is present. > > > > When writer takes write lock, it sets VMA_LOCK_WR_LOCKED bit if there > > > > are no readers or VMA_LOCK_WR_WAIT bit if readers are holding the lock > > > > and puts itself onto newly introduced mm.vma_writer_wait. Since all > > > > writers take mmap_lock in write mode first, there can be only one writer > > > > at a time. The last reader to release the lock will signal the writer > > > > to wake up. > > > > > > I don't think you need two bits. You can do it this way: > > > > > > 0x8000'0000 - No readers, no writers > > > 0x1-7fff'ffff - Some number of readers > > > 0x0 - Writer held > > > 0x8000'0001-0xffff'ffff - Reader held, writer waiting > > > > > > A prospective writer subtracts 0x8000'0000. If the result is 0, it got > > > the lock, otherwise it sleeps until it is 0. > > > > > > A writer unlocks by adding 0x8000'0000 (not by setting the value to > > > 0x8000'0000). > > > > > > A reader unlocks by adding 1. If the result is 0, it wakes the writer. > > > > > > A prospective reader subtracts 1. If the result is positive, it got the > > > lock, otherwise it does the unlock above (this might be the one which > > > wakes the writer). > > > > > > And ... that's it. See how we use the CPU arithmetic flags to tell us > > > everything we need to know without doing arithmetic separately? > > > > Yes, this is neat! You are using the fact that write-locked == no > > readers to eliminate unnecessary state. I'll give that a try. Thanks! > > The reason I got here is that Vlastimil poked me about the whole > TYPESAFE_BY_RCU thing. > > So the normal way those things work is with a refcount, if the refcount > is non-zero, the identifying fields should be stable and you can > determine if you have the right object, otherwise tough luck. > > And I was thinking that since you abuse this rwsem you have, you might > as well turn that into a refcount with some extra. > > So I would propose a slightly different solution. > > Replace vm_lock with vm_refcnt. Replace vm_detached with vm_refcnt == 0 > -- that is, attach sets refcount to 1 to indicate it is part of the mas, > detached is the final 'put'. I need to double-check if we ever write-lock a detached vma. I don't think we do but better be safe. If we do then that wait-until() should accept 0x8000'0001 as well. > > RCU lookup does the inc_not_zero thing, when increment succeeds, compare > mm/addr to validate. > > vma_start_write() already relies on mmap_lock being held for writing, > and thus does not have to worry about writer-vs-writer contention, that > is fully resolved by mmap_sem. This means we only need to wait for > readers to drop out. > > vma_start_write() > add(0x8000'0001); // could fetch_add and double check the high > // bit wasn't already set. > wait-until(refcnt == 0x8000'0002); // mas + writer ref > WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq); > sub(0x8000'0000); > > vma_end_write() > put(); We don't really have vma_end_write(). Instead it's vma_end_write_all() which increments mm_lock_seq unlocking all write-locked VMAs. Therefore in vma_start_write() I think we can sub(0x8000'0001) at the end. > > vma_start_read() then becomes something like: > > if (vm_lock_seq == mm_lock_seq) > return false; > > cnt = fetch_inc(1); > if (cnt & msb || vm_lock_seq == mm_lock_seq) { > put(); > return false; > } > > return true; > > vma_end_read() then becomes: > put(); > > > and the down_read() from uffffffd requires mmap_read_lock() and thus > does not have to worry about writers, it can simpy be inc() and put(), > no? I think your proposal should work. Let me try to code it and see if something breaks. Thanks Peter!