On Wed, Dec 18, 2024 at 1:41 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > On Tue, Dec 17, 2024 at 08:27:46AM -0800, Suren Baghdasaryan wrote: > > > > So I just replied there, and no, I don't think it makes sense. Just put > > > the kmem_cache_free() in vma_refcount_put(), to be done on 0. > > > > That's very appealing indeed and makes things much simpler. The > > problem I see with that is the case when we detach a vma from the tree > > to isolate it, then do some cleanup and only then free it. That's done > > in vms_gather_munmap_vmas() here: > > https://elixir.bootlin.com/linux/v6.12.5/source/mm/vma.c#L1240 and we > > even might reattach detached vmas back: > > https://elixir.bootlin.com/linux/v6.12.5/source/mm/vma.c#L1312. IOW, > > detached state is not final and we can't destroy the object that > > reached this state. > > Urgh, so that's the munmap() path, but arguably when that fails, the > map stays in place. > > I think this means you're marking detached too soon; you should only > mark detached once you reach the point of no return. > > That said, once you've reached the point of no return; and are about to > go remove the page-tables, you very much want to ensure a lack of > concurrency. > > So perhaps waiting for out-standing readers at this point isn't crazy. > > Also, I'm having a very hard time reading this maple tree stuff :/ > Afaict vms_gather_munmap_vmas() only adds the VMAs to be removed to a > second tree, it does not in fact unlink them from the mm yet. Yes, I think you are correct. > > AFAICT it's vma_iter_clear_gfp() that actually wipes the vmas from the > mm -- and that being able to fail is mind boggling and I suppose is what > gives rise to much of this insanity :/ > > Anyway, I would expect remove_vma() to be the one that marks it detached > (it's already unreachable through vma_lookup() at this point) and there > you should wait for concurrent readers to bugger off. There is an issue with that. Note that vms_complete_munmap_vmas() that's calling remove_vma() might drop the mmap write lock, so detaching without a write lock would break current rules. > > > We could change states to: 0=unused (we can free > > the object), 1=detached, 2=attached, etc. but then vma_start_read() > > should do something like refcount_inc_more_than_one() instead of > > refcount_inc_not_zero(). Would you be ok with such an approach? > > Urgh, I would strongly suggest ditching refcount_t if we go this route. > The thing is; refcount_t should remain a 'simple' straight forward > interface and not allow people to do the wrong thing. Its not meant to > be the kitchen sink -- we have atomic_t for that. Ack. If we go this route I'll use atomics directly. > > Anyway, the more common scheme at that point is using -1 for 'free', I > think folio->_mapcount uses that even. For that see: > atomic_add_negative*(). Thanks for the reference. > > > > Additionally, having vma_end_write() would allow you to put a lockdep > > > annotation in vma_{start,end}_write() -- which was I think the original > > > reason I proposed it a while back, that and having improved clarity when > > > reading the code, since explicitly marking the end of a section is > > > helpful. > > > > The vma->vmlock_dep_map is tracking vma->vm_refcnt, not the > > vma->vm_lock_seq (similar to how today vma->vm_lock has its lockdep > > tracking that rw_semaphore). If I implement vma_end_write() then it > > will simply be something like: > > > > void vma_end_write(vma) > > { > > vma_assert_write_locked(vma); > > vma->vm_lock_seq = UINT_MAX; > > } > > > > so, vmlock_dep_map would not be involved. > > That's just weird; why would you not track vma_{start,end}_write() with > the exclusive side of the 'rwsem' dep_map ? > > > If you want to track vma->vm_lock_seq with a separate lockdep, that > > would be more complicated. Specifically for vma_end_write_all() that > > would require us to call rwsem_release() on all locked vmas, however > > we currently do not track individual locked vmas. vma_end_write_all() > > allows us not to worry about tracking them, knowing that once we do > > mmap_write_unlock() they all will get unlocked with one increment of > > mm->mm_lock_seq. If your suggestion is to replace vma_end_write_all() > > with vma_end_write() and unlock vmas individually across the mm code, > > that would be a sizable effort. If that is indeed your ultimate goal, > > I can do that as a separate project: introduce vma_end_write(), > > gradually add them in required places (not yet sure how complex that > > would be), then retire vma_end_write_all() and add a lockdep for > > vma->vm_lock_seq. > > Yeah, so ultimately I think it would be clearer if you explicitly mark > the point where the vma modification is 'done'. But I don't suppose we > have to do that here. Ack.