On Thu, Dec 12, 2024 at 06:17:44AM -0800, Suren Baghdasaryan wrote: > On Thu, Dec 12, 2024 at 1:17 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > > > On Wed, Dec 11, 2024 at 07:01:16PM -0800, Suren Baghdasaryan wrote: > > > > > > > > I think your proposal should work. Let me try to code it and see if > > > > > > something breaks. > > > > > > Ok, I tried it out and things are a bit more complex: > > > 1. We should allow write-locking a detached VMA, IOW vma_start_write() > > > can be called when vm_refcnt is 0. > > > > This sounds dodgy, refcnt being zero basically means the object is dead > > and you shouldn't be touching it no more. Where does this happen and > > why? > > > > Notably, it being 0 means it is no longer in the mas tree and can't be > > found anymore. > > It happens when a newly created vma that was not yet attached > (vma->vm_refcnt = 0) is write-locked before being added into the vma > tree. For example: > mmap() > mmap_write_lock() > vma = vm_area_alloc() // vma->vm_refcnt = 0 (detached) > //vma attributes are initialized > vma_start_write() // write 0x8000 0001 into vma->vm_refcnt > mas_store_gfp() > vma_mark_attached() > mmap_write_lock() // vma_end_write_all() > > In this scenario, we write-lock the VMA before adding it into the tree > to prevent readers (pagefaults) from using it until we drop the > mmap_write_lock(). Ah, but you can do that by setting vma->vm_lock_seq and setting the ref to 1 before adding it (its not visible before adding anyway, so nobody cares). You'll note that the read thing checks both the msb (or other high bit depending on the actual type you're going with) *and* the seq. That is needed because we must not set the sequence number before all existing readers are drained, but since this is pre-add that is not a concern. > > > 2. Adding 0x80000000 saturates refcnt, so I have to use a lower bit > > > 0x40000000 to denote writers. > > > > I'm confused, what? We're talking about atomic_t, right? > > I thought you suggested using refcount_t. According to > https://elixir.bootlin.com/linux/v6.13-rc2/source/include/linux/refcount.h#L22 > valid values would be [0..0x7fff_ffff] and 0x80000000 is outside of > that range. What am I missing? I was talking about atomic_t :-), but yeah, maybe we can use refcount_t, but I hadn't initially considered that.