Re: [PATCH mm/bpf v2] mm: bpf: add find_vma_no_check() without lockdep_assert on mm->mmap_lock

Jason Gunthorpe <jgg@xxxxxxxx> · Wed, 8 Sep 2021 20:33:31 -0300

On Wed, Sep 08, 2021 at 12:11:54PM -0700, Yonghong Song wrote:
> 
> 
> On 9/8/21 11:49 AM, Jason Gunthorpe wrote:
> > On Wed, Sep 08, 2021 at 06:30:52PM +0000, Liam Howlett wrote:
> > 
> > >   /* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
> > > -struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
> > > +struct vm_area_struct *find_vma_non_owner(struct mm_struct *mm,
> > > +					 unsigned long addr)
> > >   {
> > >   	struct rb_node *rb_node;
> > >   	struct vm_area_struct *vma;
> > > -	mmap_assert_locked(mm);
> > > +	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
> > >   	/* Check the cache first. */
> > >   	vma = vmacache_find(mm, addr);
> > >   	if (likely(vma))
> > > @@ -2325,6 +2326,11 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
> > >   	return vma;
> > >   }
> > > +struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
> > > +{
> > > +	lockdep_assert_held(&mm->mmap_lock);
> > > +	return find_vma_non_owner(mm, addr);
> > > +}
> > >   EXPORT_SYMBOL(find_vma);
> > >   /*
> > > 
> > > 
> > > Although this leaks more into the mm API and was referred to as ugly
> > > previously, it does provide a working solution and still maintains the
> > > same level of checking.
> > 
> > I think it is no better than before.
> > 
> > The solution must be to not break lockdep in the BPF side. If Peter's
> > reworked algorithm is not OK then BPF should drop/acquire the lockdep
> > when it punts the unlock to the WQ.
> 
> The current warning is triggered by bpf calling find_vma().

Yes, but that is because the lockdep has already been dropped.

It looks to me like it basically does this:

        mmap_read_trylock_non_owner(current->mm)

        vma = find_vma(current->mm, ips[i]);

        if (!work) {
                mmap_read_unlock_non_owner(current->mm);
        } else {
                work->mm = current->mm;
                irq_work_queue(&work->irq_work);

And the only reason for this lockdep madness is because the
irq_work_queue() does:

static void do_up_read(struct irq_work *entry)
{
        struct stack_map_irq_work *work;

        if (WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_RT)))
                return;

        work = container_of(entry, struct stack_map_irq_work, irq_work);
        mmap_read_unlock_non_owner(work->mm);
}

This is all about deferring the unlock to outside an IRQ context. The
lockdep ownership is transfered from the IRQ to the work, which is
something that we don't usually do or model in lockdep.

Lockdep complains with the straightforward code because exiting an IRQ
with locks held is illegal.

The saner version is more like:

        mmap_read_trylock(current->mm)

        vma = find_vma(current->mm, ips[i]);

        if (!work) {
                mmap_read_unlock(current->mm);
        } else {
                work->mm = current->mm;
                <tell lockdep we really do mean to return with
		 the lock held>
                rwsem_release(&mm->mmap_lock.dep_map, _RET_IP_);
                irq_work_queue(&work->irq_work);

do_up_read():
       <tell lockdep the lock was already released from the map>
       mmap_read_unlock_non_owner(work->mm);

ie properly model in lockdep that ownership moves from the IRQ to the
work. At least we don't corrupt the core mm code with this insanity.

Jason