On Thu, Dec 12, 2024 at 1:17 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > On Wed, Dec 11, 2024 at 07:01:16PM -0800, Suren Baghdasaryan wrote: > > > > > > I think your proposal should work. Let me try to code it and see if > > > > > something breaks. > > > > Ok, I tried it out and things are a bit more complex: > > 1. We should allow write-locking a detached VMA, IOW vma_start_write() > > can be called when vm_refcnt is 0. > > This sounds dodgy, refcnt being zero basically means the object is dead > and you shouldn't be touching it no more. Where does this happen and > why? > > Notably, it being 0 means it is no longer in the mas tree and can't be > found anymore. It happens when a newly created vma that was not yet attached (vma->vm_refcnt = 0) is write-locked before being added into the vma tree. For example: mmap() mmap_write_lock() vma = vm_area_alloc() // vma->vm_refcnt = 0 (detached) //vma attributes are initialized vma_start_write() // write 0x8000 0001 into vma->vm_refcnt mas_store_gfp() vma_mark_attached() mmap_write_lock() // vma_end_write_all() In this scenario, we write-lock the VMA before adding it into the tree to prevent readers (pagefaults) from using it until we drop the mmap_write_lock(). In your proposal, the first thing vma_start_write() does is add(0x8000'0001) and that will trigger a warning. For now instead of add(0x8000'0001) I can play this game to avoid the warning: if (refcount_inc_not_zero(&vma->vm_refcnt)) refcount_add(0x80000000, &vma->vm_refcnt); else refcount_set(&vma->vm_refcnt, 0x80000001); this refcount_set() works because vma with vm_refcnt==0 could not be found by readers. I'm not sure this will still work when we add TYPESAFE_BY_RCU and introduce vma reuse possibility. > > > 2. Adding 0x80000000 saturates refcnt, so I have to use a lower bit > > 0x40000000 to denote writers. > > I'm confused, what? We're talking about atomic_t, right? I thought you suggested using refcount_t. According to https://elixir.bootlin.com/linux/v6.13-rc2/source/include/linux/refcount.h#L22 valid values would be [0..0x7fff_ffff] and 0x80000000 is outside of that range. What am I missing? > > > 3. Currently vma_mark_attached() can be called on an already attached > > VMA. With vma->detached being a separate attribute that works fine but > > when we combine it with the vm_lock things break (extra attach would > > leak into lock count). I'll see if I can catch all the cases when we > > do this and clean them up (not call vma_mark_attached() when not > > necessary). > > Right, I hadn't looked at that thing in detail, that sounds like it > needs a wee cleanup like you suggest. Yes, I'll embark on that today. Will see how much of a problem that is.