On Thu, Dec 12, 2024 at 6:19 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > On Thu, Dec 12, 2024 at 6:17 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > > > On Thu, Dec 12, 2024 at 1:17 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > > > > > On Wed, Dec 11, 2024 at 07:01:16PM -0800, Suren Baghdasaryan wrote: > > > > > > > > > > I think your proposal should work. Let me try to code it and see if > > > > > > > something breaks. > > > > > > > > Ok, I tried it out and things are a bit more complex: > > > > 1. We should allow write-locking a detached VMA, IOW vma_start_write() > > > > can be called when vm_refcnt is 0. > > > > > > This sounds dodgy, refcnt being zero basically means the object is dead > > > and you shouldn't be touching it no more. Where does this happen and > > > why? > > > > > > Notably, it being 0 means it is no longer in the mas tree and can't be > > > found anymore. > > > > It happens when a newly created vma that was not yet attached > > (vma->vm_refcnt = 0) is write-locked before being added into the vma > > tree. For example: > > mmap() > > mmap_write_lock() > > vma = vm_area_alloc() // vma->vm_refcnt = 0 (detached) > > //vma attributes are initialized > > vma_start_write() // write 0x8000 0001 into vma->vm_refcnt > > mas_store_gfp() > > vma_mark_attached() > > mmap_write_lock() // vma_end_write_all() > > s/mmap_write_lock()/mmap_write_unlock() > > > > In this scenario, we write-lock the VMA before adding it into the tree > > to prevent readers (pagefaults) from using it until we drop the > > mmap_write_lock(). In your proposal, the first thing vma_start_write() > > does is add(0x8000'0001) and that will trigger a warning. > > For now instead of add(0x8000'0001) I can play this game to avoid the warning: > > > > if (refcount_inc_not_zero(&vma->vm_refcnt)) > > refcount_add(0x80000000, &vma->vm_refcnt); > > else > > refcount_set(&vma->vm_refcnt, 0x80000001); > > > > this refcount_set() works because vma with vm_refcnt==0 could not be > > found by readers. I'm not sure this will still work when we add > > TYPESAFE_BY_RCU and introduce vma reuse possibility. > > > > > > > > > 2. Adding 0x80000000 saturates refcnt, so I have to use a lower bit > > > > 0x40000000 to denote writers. > > > > > > I'm confused, what? We're talking about atomic_t, right? > > > > I thought you suggested using refcount_t. According to > > https://elixir.bootlin.com/linux/v6.13-rc2/source/include/linux/refcount.h#L22 > > valid values would be [0..0x7fff_ffff] and 0x80000000 is outside of > > that range. What am I missing? > > > > > > > > > 3. Currently vma_mark_attached() can be called on an already attached > > > > VMA. With vma->detached being a separate attribute that works fine but > > > > when we combine it with the vm_lock things break (extra attach would > > > > leak into lock count). I'll see if I can catch all the cases when we > > > > do this and clean them up (not call vma_mark_attached() when not > > > > necessary). > > > > > > Right, I hadn't looked at that thing in detail, that sounds like it > > > needs a wee cleanup like you suggest. > > > > Yes, I'll embark on that today. Will see how much of a problem that is. Ok, I think I was able to implement this in a way that ignores duplicate attach/detach calls. One issue that I hit and don't know a good way to fix is a circular dependency in the header files once I try adding rcuwait into mm_struct. Once I include rcuwait.h into mm_types.h leads to the following cycle: In file included from ./arch/x86/include/asm/uaccess.h:12, from ./include/linux/uaccess.h:12, from ./include/linux/sched/task.h:13, from ./include/linux/sched/signal.h:9, from ./include/linux/rcuwait.h:6, from ./include/linux/mm_types.h:22, ./arch/x86/include/asm/uaccess.h includes mm_types.h. But in fact there is a shorter cycle: rcuwait.h needs signal.h since it uses uses inlined signal_pending_state() signal.h needs mm_types.h since it uses vm_fault_t The way I worked around it for now is by removing signal.h include from rcuwait.h and wrapping signal_pending_state() into a non-inlined function so I can forward-declare it. That requires adding linux/sched/signal.h or linux/sched/task.h into many other places: arch/x86/coco/sev/core.c | 1 + arch/x86/kernel/fpu/xstate.c | 1 + block/blk-lib.c | 1 + block/ioctl.c | 1 + drivers/accel/qaic/qaic_control.c | 1 + drivers/base/firmware_loader/main.c | 1 + drivers/block/ublk_drv.c | 1 + drivers/crypto/ccp/sev-dev.c | 1 + drivers/dma-buf/heaps/cma_heap.c | 1 + drivers/dma-buf/heaps/system_heap.c | 1 + drivers/gpio/gpio-sloppy-logic-analyzer.c | 1 + drivers/gpu/drm/etnaviv/etnaviv_gpu.c | 1 + drivers/iommu/iommufd/ioas.c | 1 + drivers/iommu/iommufd/pages.c | 1 + .../staging/vc04_services/interface/vchiq_arm/vchiq_dev.c | 1 + drivers/tty/n_tty.c | 1 + fs/bcachefs/fs-io-buffered.c | 1 + fs/bcachefs/journal_reclaim.c | 1 + fs/bcachefs/thread_with_file.c | 1 + fs/bcachefs/util.c | 1 + fs/btrfs/defrag.h | 1 + fs/btrfs/fiemap.c | 2 ++ fs/btrfs/free-space-cache.h | 1 + fs/btrfs/reflink.c | 1 + fs/exfat/balloc.c | 1 + fs/gfs2/ops_fstype.c | 1 + fs/kernel_read_file.c | 1 + fs/netfs/buffered_write.c | 1 + fs/zonefs/file.c | 1 + include/linux/fs.h | 2 +- include/linux/rcuwait.h | 4 ++-- io_uring/io_uring.h | 1 + kernel/dma/map_benchmark.c | 1 + kernel/futex/core.c | 1 + kernel/futex/pi.c | 1 + kernel/rcu/update.c | 5 +++++ kernel/task_work.c | 1 + lib/kunit/user_alloc.c | 1 + mm/damon/vaddr.c | 1 + mm/memcontrol-v1.c | 1 + mm/shrinker_debug.c | 1 + net/dns_resolver/dns_key.c | 1 + 42 files changed, 48 insertions(+), 3 deletions(-) I'm not sure if this is the best way to deal with this circular dependency. Any other ideas?