Re: [PATCH mm/bpf v2] mm: bpf: add find_vma_no_check() without lockdep_assert on mm->mmap_lock

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 9/8/21 4:33 PM, Jason Gunthorpe wrote:
On Wed, Sep 08, 2021 at 12:11:54PM -0700, Yonghong Song wrote:


On 9/8/21 11:49 AM, Jason Gunthorpe wrote:
On Wed, Sep 08, 2021 at 06:30:52PM +0000, Liam Howlett wrote:

   /* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
-struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
+struct vm_area_struct *find_vma_non_owner(struct mm_struct *mm,
+					 unsigned long addr)
   {
   	struct rb_node *rb_node;
   	struct vm_area_struct *vma;
-	mmap_assert_locked(mm);
+	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
   	/* Check the cache first. */
   	vma = vmacache_find(mm, addr);
   	if (likely(vma))
@@ -2325,6 +2326,11 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
   	return vma;
   }
+struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
+{
+	lockdep_assert_held(&mm->mmap_lock);
+	return find_vma_non_owner(mm, addr);
+}
   EXPORT_SYMBOL(find_vma);
   /*


Although this leaks more into the mm API and was referred to as ugly
previously, it does provide a working solution and still maintains the
same level of checking.

I think it is no better than before.

The solution must be to not break lockdep in the BPF side. If Peter's
reworked algorithm is not OK then BPF should drop/acquire the lockdep
when it punts the unlock to the WQ.

The current warning is triggered by bpf calling find_vma().

Yes, but that is because the lockdep has already been dropped.

It looks to me like it basically does this:

         mmap_read_trylock_non_owner(current->mm)

         vma = find_vma(current->mm, ips[i]);

         if (!work) {
                 mmap_read_unlock_non_owner(current->mm);
         } else {
                 work->mm = current->mm;
                 irq_work_queue(&work->irq_work);


And the only reason for this lockdep madness is because the
irq_work_queue() does:

static void do_up_read(struct irq_work *entry)
{
         struct stack_map_irq_work *work;

         if (WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_RT)))
                 return;

         work = container_of(entry, struct stack_map_irq_work, irq_work);
         mmap_read_unlock_non_owner(work->mm);
}


This is all about deferring the unlock to outside an IRQ context. The
lockdep ownership is transfered from the IRQ to the work, which is
something that we don't usually do or model in lockdep.

Lockdep complains with the straightforward code because exiting an IRQ
with locks held is illegal.

The saner version is more like:

         mmap_read_trylock(current->mm)

         vma = find_vma(current->mm, ips[i]);

         if (!work) {
                 mmap_read_unlock(current->mm);
         } else {
                 work->mm = current->mm;
                 <tell lockdep we really do mean to return with
		 the lock held>
                 rwsem_release(&mm->mmap_lock.dep_map, _RET_IP_);
                 irq_work_queue(&work->irq_work);


do_up_read():
        <tell lockdep the lock was already released from the map>
        mmap_read_unlock_non_owner(work->mm);

ie properly model in lockdep that ownership moves from the IRQ to the
work. At least we don't corrupt the core mm code with this insanity.

Thanks for the suggestion! I verified the above change indeed work.
Will send v4 soon.


Jason




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux