On Sun, Jan 12, 2025 at 09:35:25AM -0800, Suren Baghdasaryan wrote: >On Sat, Jan 11, 2025 at 6:59 PM Wei Yang <richard.weiyang@xxxxxxxxx> wrote: >> >> On Fri, Jan 10, 2025 at 08:25:58PM -0800, Suren Baghdasaryan wrote: >> >rw_semaphore is a sizable structure of 40 bytes and consumes >> >considerable space for each vm_area_struct. However vma_lock has >> >two important specifics which can be used to replace rw_semaphore >> >with a simpler structure: >> >1. Readers never wait. They try to take the vma_lock and fall back to >> >mmap_lock if that fails. >> >2. Only one writer at a time will ever try to write-lock a vma_lock >> >because writers first take mmap_lock in write mode. >> >Because of these requirements, full rw_semaphore functionality is not >> >needed and we can replace rw_semaphore and the vma->detached flag with >> >a refcount (vm_refcnt). >> >> This paragraph is merged into the above one in the commit log, which may not >> what you expect. >> >> Just a format issue, not sure why they are not separated. > >I'll double-check the formatting. Thanks! > >> >> >When vma is in detached state, vm_refcnt is 0 and only a call to >> >vma_mark_attached() can take it out of this state. Note that unlike >> >before, now we enforce both vma_mark_attached() and vma_mark_detached() >> >to be done only after vma has been write-locked. vma_mark_attached() >> >changes vm_refcnt to 1 to indicate that it has been attached to the vma >> >tree. When a reader takes read lock, it increments vm_refcnt, unless the >> >top usable bit of vm_refcnt (0x40000000) is set, indicating presence of >> >a writer. When writer takes write lock, it sets the top usable bit to >> >indicate its presence. If there are readers, writer will wait using newly >> >introduced mm->vma_writer_wait. Since all writers take mmap_lock in write >> >mode first, there can be only one writer at a time. The last reader to >> >release the lock will signal the writer to wake up. >> >refcount might overflow if there are many competing readers, in which case >> >read-locking will fail. Readers are expected to handle such failures. >> >In summary: >> >1. all readers increment the vm_refcnt; >> >2. writer sets top usable (writer) bit of vm_refcnt; >> >3. readers cannot increment the vm_refcnt if the writer bit is set; >> >4. in the presence of readers, writer must wait for the vm_refcnt to drop >> >to 1 (ignoring the writer bit), indicating an attached vma with no readers; >> >> It waits until to (VMA_LOCK_OFFSET + 1) as indicates in __vma_start_write(), >> if I am right. > >Yeah, that's why I mentioned "(ignoring the writer bit)" but maybe >that's too confusing. How about "drop to 1 (plus the VMA_LOCK_OFFSET >writer bit)? > Hmm.. hard to say. It is a little confusing, but I don't have a better one :-( >> >> >5. vm_refcnt overflow is handled by the readers. >> > >> >While this vm_lock replacement does not yet result in a smaller >> >vm_area_struct (it stays at 256 bytes due to cacheline alignment), it >> >allows for further size optimization by structure member regrouping >> >to bring the size of vm_area_struct below 192 bytes. >> > >> -- >> Wei Yang >> Help you, Help me -- Wei Yang Help you, Help me