* Miaohe Lin <linmiaohe@xxxxxxxxxx> [240129 21:14]: > On 2024/1/30 0:17, Liam R. Howlett wrote: > > * Miaohe Lin <linmiaohe@xxxxxxxxxx> [240129 07:56]: > >> On 2024/1/27 18:13, Miaohe Lin wrote: > >>> On 2024/1/26 15:50, Muchun Song wrote: > >>>> > >>>> > >>>>> On Jan 26, 2024, at 04:28, Thorvald Natvig <thorvald@xxxxxxxxxx> wrote: > >>>>> > >>>>> We've found what appears to be a lock issue that results in a blocked > >>>>> process somewhere in hugetlbfs for shared maps; seemingly from an > >>>>> interaction between hugetlb_vm_op_open and hugetlb_vmdelete_list. > >>>>> > >>>>> Based on some added pr_warn, we believe the following is happening: > >>>>> When hugetlb_vmdelete_list is entered from the child process, > >>>>> vma->vm_private_data is NULL, and hence hugetlb_vma_trylock_write does > >>>>> not lock, since neither __vma_shareable_lock nor __vma_private_lock > >>>>> are true. > >>>>> > >>>>> While hugetlb_vmdelete_list is executing, the parent process does > >>>>> fork(), which ends up in hugetlb_vm_op_open, which in turn allocates a > >>>>> lock for the same vma. > >>>>> > >>>>> Thus, when the hugetlb_vmdelete_list in the child reaches the end of > >>>>> the function, vma->vm_private_data is now populated, and hence > >>>>> hugetlb_vma_unlock_write tries to unlock the vma_lock, which it does > >>>>> not hold. > >>>> > >>>> Thanks for your report. ->vm_private_data was introduced since the > >>>> series [1]. So I suspect it was caused by this. But I haven't reviewed > >>>> that at that time (actually, it is a little complex in pmd sharing > >>>> case). I saw Miaohe had reviewed many of those. > >>>> > >>>> CC Miaohe, maybe he has some ideas on this. > >>>> > >>>> [1] https://lore.kernel.org/all/20220914221810.95771-7-mike.kravetz@xxxxxxxxxx/T/#m2141e4bc30401a8ce490b1965b9bad74e7f791ff > >>>> > >>>> Thanks. > >>>> > >>>>> > >>>>> dmesg: > >>>>> WARNING: bad unlock balance detected! > >>>>> 6.8.0-rc1+ #24 Not tainted > >>>>> ------------------------------------- > >>>>> lock/2613 is trying to release lock (&vma_lock->rw_sema) at: > >>>>> [<ffffffffa94c6128>] hugetlb_vma_unlock_write+0x48/0x60 > >>>>> but there are no more locks to release! > >>> > >>> Thanks for your report. It seems there's a race: > >>> > >>> CPU 1 CPU 2 > >>> fork hugetlbfs_fallocate > >>> dup_mmap hugetlbfs_punch_hole > >>> i_mmap_lock_write(mapping); > >>> vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree. > >>> i_mmap_unlock_write(mapping); > >>> hugetlb_dup_vma_private -- Clear vma_lock outside i_mmap_rwsem! i_mmap_lock_write(mapping); > >>> hugetlb_vmdelete_list > >>> vma_interval_tree_foreach > >>> hugetlb_vma_trylock_write -- Vma_lock is cleared. > >>> tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem! > >>> hugetlb_vma_unlock_write -- Vma_lock is assigned!!! > >>> i_mmap_unlock_write(mapping); > >>> > >>> hugetlb_dup_vma_private and hugetlb_vm_op_open are called outside i_mmap_rwsem lock. So there will be another bugs behind it. > >>> But I'm not really sure. I will take a more closed look at next week. > >> > >> > >> This can be fixed by deferring vma_interval_tree_insert_after() until vma is fully initialized. > >> But I'm not sure whether there're side effects with this patch. > >> > >> linux-UJMmTI:/home/linmiaohe/mm # git diff > >> diff --git a/kernel/fork.c b/kernel/fork.c > >> index 47ff3b35352e..2ef2711452e0 100644 > >> --- a/kernel/fork.c > >> +++ b/kernel/fork.c > >> @@ -712,21 +712,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > >> } else if (anon_vma_fork(tmp, mpnt)) > >> goto fail_nomem_anon_vma_fork; > >> vm_flags_clear(tmp, VM_LOCKED_MASK); > >> - file = tmp->vm_file; > >> - if (file) { > >> - struct address_space *mapping = file->f_mapping; > >> - > >> - get_file(file); > >> - i_mmap_lock_write(mapping); > >> - if (vma_is_shared_maywrite(tmp)) > >> - mapping_allow_writable(mapping); > >> - flush_dcache_mmap_lock(mapping); > >> - /* insert tmp into the share list, just after mpnt */ > >> - vma_interval_tree_insert_after(tmp, mpnt, > >> - &mapping->i_mmap); > >> - flush_dcache_mmap_unlock(mapping); > >> - i_mmap_unlock_write(mapping); > >> - } > >> > >> /* > >> * Copy/update hugetlb private vma information. > >> @@ -747,6 +732,22 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > >> if (tmp->vm_ops && tmp->vm_ops->open) > >> tmp->vm_ops->open(tmp); > >> > >> + file = tmp->vm_file; > >> + if (file) { > >> + struct address_space *mapping = file->f_mapping; > >> + > >> + get_file(file); > >> + i_mmap_lock_write(mapping); > >> + if (vma_is_shared_maywrite(tmp)) > >> + mapping_allow_writable(mapping); > >> + flush_dcache_mmap_lock(mapping); > >> + /* insert tmp into the share list, just after mpnt. */ > >> + vma_interval_tree_insert_after(tmp, mpnt, > >> + &mapping->i_mmap); > >> + flush_dcache_mmap_unlock(mapping); > >> + i_mmap_unlock_write(mapping); > >> + } > >> + > >> if (retval) { > >> mpnt = vma_next(&vmi); > >> goto loop_out; > >> > >> > > > > How is this possible? I thought, as specified in mm/rmap.c, that the > > hugetlbfs path would be holding the mmap lock (which is also held in the > > fork path)? > > The fork path holds the mmap lock from parent A and other childs(except first child B) while hugetlbfs path > holds the mmap lock from first child B. So the mmap lock won't help here because it comes from different mm. > Or am I miss something? You are correct. It is also in mm/rmap.c: * hugetlbfs PageHuge() take locks in this order: * hugetlb_fault_mutex (hugetlbfs specific page fault mutex) * vma_lock (hugetlb specific lock for pmd_sharing) * mapping->i_mmap_rwsem (also used for hugetlb pmd sharing) * page->flags PG_locked (lock_page) Does it make sense for hugetlb_dup_vma_private() to assert mapping->i_mmap_rwsem is locked? When is that necessary? I also think it might be safer to move the hugetlb_dup_vma_private() call up instead of the insert into the interval tree down? See the following comment from mmap.c: /* * Put into interval tree now, so instantiated pages * are visible to arm/parisc __flush_dcache_page * throughout; but we cannot insert into address * space until vma start or end is updated. */ So there may be arch dependent reasons for this order. Thanks, Liam