On 2/13/25 23:46, Suren Baghdasaryan wrote: > rw_semaphore is a sizable structure of 40 bytes and consumes > considerable space for each vm_area_struct. However vma_lock has > two important specifics which can be used to replace rw_semaphore > with a simpler structure: > 1. Readers never wait. They try to take the vma_lock and fall back to > mmap_lock if that fails. > 2. Only one writer at a time will ever try to write-lock a vma_lock > because writers first take mmap_lock in write mode. > Because of these requirements, full rw_semaphore functionality is not > needed and we can replace rw_semaphore and the vma->detached flag with > a refcount (vm_refcnt). > > When vma is in detached state, vm_refcnt is 0 and only a call to > vma_mark_attached() can take it out of this state. Note that unlike > before, now we enforce both vma_mark_attached() and vma_mark_detached() > to be done only after vma has been write-locked. vma_mark_attached() > changes vm_refcnt to 1 to indicate that it has been attached to the vma > tree. When a reader takes read lock, it increments vm_refcnt, unless the > top usable bit of vm_refcnt (0x40000000) is set, indicating presence of > a writer. When writer takes write lock, it sets the top usable bit to > indicate its presence. If there are readers, writer will wait using newly > introduced mm->vma_writer_wait. Since all writers take mmap_lock in write > mode first, there can be only one writer at a time. The last reader to > release the lock will signal the writer to wake up. > refcount might overflow if there are many competing readers, in which case > read-locking will fail. Readers are expected to handle such failures. > > In summary: > 1. all readers increment the vm_refcnt; > 2. writer sets top usable (writer) bit of vm_refcnt; > 3. readers cannot increment the vm_refcnt if the writer bit is set; > 4. in the presence of readers, writer must wait for the vm_refcnt to drop > to 1 (plus the VMA_LOCK_OFFSET writer bit), indicating an attached vma > with no readers; > 5. vm_refcnt overflow is handled by the readers. > > While this vm_lock replacement does not yet result in a smaller > vm_area_struct (it stays at 256 bytes due to cacheline alignment), it > allows for further size optimization by structure member regrouping > to bring the size of vm_area_struct below 192 bytes. > > Suggested-by: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > Suggested-by: Matthew Wilcox <willy@xxxxxxxxxxxxx> > Signed-off-by: Suren Baghdasaryan <surenb@xxxxxxxxxx> with the fix, Reviewed-by: Vlastimil Babka <vbabka@xxxxxxx>