On Thu, Aug 15, 2024 at 11:58 AM Jann Horn <jannh@xxxxxxxxxx> wrote: > > +brauner for "struct file" lifetime > > On Thu, Aug 15, 2024 at 7:45 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > On Thu, Aug 15, 2024 at 9:47 AM Andrii Nakryiko > > <andrii.nakryiko@xxxxxxxxx> wrote: > > > > > > On Thu, Aug 15, 2024 at 6:44 AM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote: > > > > > > > > On Tue, Aug 13, 2024 at 08:36:03AM -0700, Suren Baghdasaryan wrote: > > > > > On Mon, Aug 12, 2024 at 11:18 PM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote: > > > > > > > > > > > > On Mon, Aug 12, 2024 at 09:29:17PM -0700, Andrii Nakryiko wrote: > > > > > > > Now that files_cachep is SLAB_TYPESAFE_BY_RCU, we can safely access > > > > > > > vma->vm_file->f_inode lockless only under rcu_read_lock() protection, > > > > > > > attempting uprobe look up speculatively. > > Stupid question: Is this uprobe stuff actually such a hot codepath > that it makes sense to optimize it to be faster than the page fault > path? > > (Sidenote: I find it kinda interesting that this is sort of going back > in the direction of the old Speculative Page Faults design.) > > > > > > > > We rely on newly added mmap_lock_speculation_{start,end}() helpers to > > > > > > > validate that mm_struct stays intact for entire duration of this > > > > > > > speculation. If not, we fall back to mmap_lock-protected lookup. > > > > > > > > > > > > > > This allows to avoid contention on mmap_lock in absolutely majority of > > > > > > > cases, nicely improving uprobe/uretprobe scalability. > > > > > > > > > > > > > > > > > > > Here I have to admit to being mostly ignorant about the mm, so bear with > > > > > > me. :> > > > > > > > > > > > > I note the result of find_active_uprobe_speculative is immediately stale > > > > > > in face of modifications. > > > > > > > > > > > > The thing I'm after is that the mmap_lock_speculation business adds > > > > > > overhead on archs where a release fence is not a de facto nop and I > > > > > > don't believe the commit message justifies it. Definitely a bummer to > > > > > > add merely it for uprobes. If there are bigger plans concerning it > > > > > > that's a different story of course. > > > > > > > > > > > > With this in mind I have to ask if instead you could perhaps get away > > > > > > with the already present per-vma sequence counter? > > > > > > > > > > per-vma sequence counter does not implement acquire/release logic, it > > > > > relies on vma->vm_lock for synchronization. So if we want to use it, > > > > > we would have to add additional memory barriers here. This is likely > > > > > possible but as I mentioned before we would need to ensure the > > > > > pagefault path does not regress. OTOH mm->mm_lock_seq already halfway > > > > > there (it implements acquire/release logic), we just had to ensure > > > > > mmap_write_lock() increments mm->mm_lock_seq. > > > > > > > > > > So, from the release fence overhead POV I think whether we use > > > > > mm->mm_lock_seq or vma->vm_lock, we would still need a proper fence > > > > > here. > > > > > > > > > > > > > Per my previous e-mail I'm not particularly familiar with mm internals, > > > > so I'm going to handwave a little bit with my $0,03 concerning multicore > > > > in general and if you disagree with it that's your business. For the > > > > time being I have no interest in digging into any of this. > > > > > > > > Before I do, to prevent this thread from being a total waste, here are > > > > some remarks concerning the patch with the assumption that the core idea > > > > lands. > > > > > > > > From the commit message: > > > > > Now that files_cachep is SLAB_TYPESAFE_BY_RCU, we can safely access > > > > > vma->vm_file->f_inode lockless only under rcu_read_lock() protection, > > > > > attempting uprobe look up speculatively. > > > > > > > > Just in case I'll note a nit that this paragraph will need to be removed > > > > since the patch adding the flag is getting dropped. > > > > > > Yep, of course, I'll update all that for the next revision (I'll wait > > > for non-RFC patches to land first before reposting). > > > > > > > > > > > A non-nit which may or may not end up mattering is that the flag (which > > > > *is* set on the filep slab cache) makes things more difficult to > > > > validate. Normal RCU usage guarantees that the object itself wont be > > > > freed as long you follow the rules. However, the SLAB_TYPESAFE_BY_RCU > > > > flag weakens it significantly -- the thing at hand will always be a > > > > 'struct file', but it may get reallocated to *another* file from under > > > > you. Whether this aspect plays a role here I don't know. > > > > > > Yes, that's ok and is accounted for. We care about that memory not > > > going even from under us (I'm not even sure if it matters that it is > > > still a struct file, tbh; I think that shouldn't matter as we are > > > prepared to deal with completely garbage values read from struct > > > file). > > > > Correct, with SLAB_TYPESAFE_BY_RCU we do need an additional check that > > vma->vm_file has not been freed and reused. That's where > > mmap_lock_speculation_{start|end} helps us. For vma->vm_file to change > > from under us one would have to take mmap_lock for write. If that > > happens mmap_lock_speculation_{start|end} should detect that and > > terminate our speculation. > > > > > > > > > > > > > > +static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr) > > > > > +{ > > > > > + const vm_flags_t flags = VM_HUGETLB | VM_MAYEXEC | VM_MAYSHARE; > > > > > + struct mm_struct *mm = current->mm; > > > > > + struct uprobe *uprobe; > > > > > + struct vm_area_struct *vma; > > > > > + struct file *vm_file; > > > > > + struct inode *vm_inode; > > > > > + unsigned long vm_pgoff, vm_start; > > > > > + int seq; > > > > > + loff_t offset; > > > > > + > > > > > + if (!mmap_lock_speculation_start(mm, &seq)) > > > > > + return NULL; > > > > > + > > > > > + rcu_read_lock(); > > > > > + > > > > > > > > I don't think there is a correctness problem here, but entering rcu > > > > *after* deciding to speculatively do the lookup feels backwards. > > > > > > RCU should protect VMA and file, mm itself won't go anywhere, so this seems ok. > > > > > > > > > > > > + vma = vma_lookup(mm, bp_vaddr); > > > > > + if (!vma) > > > > > + goto bail; > > > > > + > > > > > + vm_file = data_race(vma->vm_file); > > > > > + if (!vm_file || (vma->vm_flags & flags) != VM_MAYEXEC) > > > > > + goto bail; > > > > > + > > > > > > > > If vma teardown is allowed to progress and the file got fput'ed... > > > > > > > > > + vm_inode = data_race(vm_file->f_inode); > > > > > > > > ... the inode can be NULL, I don't know if that's handled. > > > > > > > > > > Yep, inode pointer value is part of RB-tree key, so if it's NULL, we > > > just won't find a matching uprobe. Same for any other "garbage" > > > f_inode value. Importantly, we never should dereference such inode > > > pointers, at least until we find a valid uprobe (in which case we keep > > > inode reference to it). > > > > > > > More importantly though, per my previous description of > > > > SLAB_TYPESAFE_BY_RCU, by now the file could have been reallocated and > > > > the inode you did find is completely unrelated. > > > > > > > > I understand the intent is to backpedal from everything should the mm > > > > seqc change, but the above may happen to matter. > > > > > > Yes, I think we took that into account. All that we care about is > > > memory "type safety", i.e., even if struct file's memory is reused, > > > it's ok, we'll eventually detect the change and will discard wrong > > > uprobe that we might by accident lookup (though probably in most cases > > > we just won't find a uprobe at all). > > > > > > > > > > > > + vm_pgoff = data_race(vma->vm_pgoff); > > > > > + vm_start = data_race(vma->vm_start); > > > > > + > > > > > + offset = (loff_t)(vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vm_start); > > > > > + uprobe = find_uprobe_rcu(vm_inode, offset); > > > > > + if (!uprobe) > > > > > + goto bail; > > > > > + > > > > > + /* now double check that nothing about MM changed */ > > > > > + if (!mmap_lock_speculation_end(mm, seq)) > > > > > + goto bail; > > > > > > > > This leaks the reference obtained by find_uprobe_rcu(). > > > > > > find_uprobe_rcu() doesn't obtain a reference, uprobe is RCU-protected, > > > and if caller need a refcount bump it will have to use > > > try_get_uprobe() (which might fail). > > > > > > > > > > > > + > > > > > + rcu_read_unlock(); > > > > > + > > > > > + /* happy case, we speculated successfully */ > > > > > + return uprobe; > > > > > +bail: > > > > > + rcu_read_unlock(); > > > > > + return NULL; > > > > > +} > > > > > > > > Now to some handwaving, here it is: > > > > > > > > The core of my concern is that adding more work to down_write on the > > > > mmap semaphore comes with certain side-effects and plausibly more than a > > > > sufficient speed up can be achieved without doing it. > > > > AFAIK writers of mmap_lock are not considered a fast path. In a sense > > yes, we made any writer a bit heavier but OTOH we also made > > mm->mm_lock_seq a proper sequence count which allows us to locklessly > > check if mmap_lock is write-locked. I think you asked whether there > > will be other uses for mmap_lock_speculation_{start|end} and yes. For > > example, I am planning to use them for printing /proc/{pid}/maps > > without taking mmap_lock (when it's uncontended). > > What would be the goal of this - to avoid cacheline bouncing of the > mmap lock between readers? Or to allow mmap_write_lock() to preempt > /proc/{pid}/maps readers who started out uncontended? The latter, from my early patchset which I need to refine (https://lore.kernel.org/all/20240123231014.3801041-3-surenb@xxxxxxxxxx/): This change is designed to reduce mmap_lock contention and prevent a process reading /proc/pid/maps files (often a low priority task, such as monitoring/data collection services) from blocking address space updates. > > Is the idea that you'd change show_map_vma() to first do something > like get_file_active() to increment the file refcount (because > otherwise the dentry can be freed under you and you need the dentry > for path printing), then recheck your sequence count on the mm or vma > (to avoid accessing the dentry of an unrelated file that hasn't become > userspace-visible yet and may not have a proper dentry pointer yet), > then print the file path, drop the file reference again, and in the > end recheck the sequence count again before actually returning the > printed data to userspace? Yeah, you can see the details in that link I posted above. See get_vma_snapshot() function. > > > If we have VMA seq > > counter-based detection it would be better (see below). > > > > > > > > > > An mm-wide mechanism is just incredibly coarse-grained and it may happen > > > > to perform poorly when faced with a program which likes to mess with its > > > > address space -- the fast path is going to keep failing and only > > > > inducing *more* overhead as the code decides to down_read the mmap > > > > semaphore. > > > > > > > > Furthermore there may be work currently synchronized with down_write > > > > which perhaps can transition to "merely" down_read, but by the time it > > > > happens this and possibly other consumers expect a change in the > > > > sequence counter, messing with it. > > > > > > > > To my understanding the kernel supports parallel faults with per-vma > > > > locking. I would find it surprising if the same machinery could not be > > > > used to sort out uprobe handling above. > > > > From all the above, my understanding of your objection is that > > checking mmap_lock during our speculation is too coarse-grained and > > you would prefer to use the VMA seq counter to check that the VMA we > > are working on is unchanged. I agree, that would be ideal. I had a > > quick chat with Jann about this and the conclusion we came to is that > > we would need to add an additional smp_wmb() barrier inside > > vma_start_write() and a smp_rmb() in the speculation code: > > > > static inline void vma_start_write(struct vm_area_struct *vma) > > { > > int mm_lock_seq; > > > > if (__is_vma_write_locked(vma, &mm_lock_seq)) > > return; > > > > down_write(&vma->vm_lock->lock); > > /* > > * We should use WRITE_ONCE() here because we can have concurrent reads > > * from the early lockless pessimistic check in vma_start_read(). > > * We don't really care about the correctness of that early check, but > > * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy. > > */ > > WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq); > > + smp_wmb(); > > up_write(&vma->vm_lock->lock); > > } > > > > Note: up_write(&vma->vm_lock->lock) in the vma_start_write() is not > > enough because it's one-way permeable (it's a "RELEASE operation") and > > later vma->vm_file store (or any other VMA modification) can move > > before our vma->vm_lock_seq store. > > > > This makes vma_start_write() heavier but again, it's write-locking, so > > should not be considered a fast path. > > With this change we can use the code suggested by Andrii in > > https://lore.kernel.org/all/CAEf4BzZeLg0WsYw2M7KFy0+APrPaPVBY7FbawB9vjcA2+6k69Q@xxxxxxxxxxxxxx/ > > with an additional smp_rmb(): > > > > rcu_read_lock() > > vma = find_vma(...) > > if (!vma) /* bail */ > > And maybe add some comments like: > > /* > * Load the current VMA lock sequence - we will detect if anyone concurrently > * locks the VMA after this point. > * Pairs with smp_wmb() in vma_start_write(). > */ > > vm_lock_seq = smp_load_acquire(&vma->vm_lock_seq); > /* > * Now we just have to detect if the VMA is already locked with its current > * sequence count. > * > * The following load is ordered against the vm_lock_seq load above (using > * smp_load_acquire() for the load above), and pairs with implicit memory > * ordering between the mm_lock_seq write in mmap_write_unlock() and the > * vm_lock_seq write in the next vma_start_write() after that (which can only > * occur after an mmap_write_lock()). > */ > > mm_lock_seq = smp_load_acquire(&vma->mm->mm_lock_seq); > > /* I think vm_lock has to be acquired first to avoid the race */ > > if (mm_lock_seq == vm_lock_seq) > > /* bail, vma is write-locked */ > > ... perform uprobe lookup logic based on vma->vm_file->f_inode ... > /* > * Order the speculative accesses above against the following vm_lock_seq > * recheck. > */ > > smp_rmb(); > > if (vma->vm_lock_seq != vm_lock_seq) > > (As I said on the other thread: Since this now relies on > vma->vm_lock_seq not wrapping back to the same value for correctness, > I'd like to see vma->vm_lock_seq being at least an "unsigned long", or > even better, an atomic64_t... though I realize we don't currently do > that for seqlocks either.) > > > /* bail, VMA might have changed */ > > > > The smp_rmb() is needed so that vma->vm_lock_seq load does not get > > reordered and moved up before speculation. > > > > I'm CC'ing Jann since he understands memory barriers way better than > > me and will keep me honest.