On 2024/1/30 12:08, Liam R. Howlett wrote: > * Miaohe Lin <linmiaohe@xxxxxxxxxx> [240129 21:14]: >> On 2024/1/30 0:17, Liam R. Howlett wrote: >>> * Miaohe Lin <linmiaohe@xxxxxxxxxx> [240129 07:56]: >>>> On 2024/1/27 18:13, Miaohe Lin wrote: >>>>> On 2024/1/26 15:50, Muchun Song wrote: >>>>>> >>>>>> >>>>>>> On Jan 26, 2024, at 04:28, Thorvald Natvig <thorvald@xxxxxxxxxx> wrote: >>>>>>> >>>>>>> We've found what appears to be a lock issue that results in a blocked >>>>>>> process somewhere in hugetlbfs for shared maps; seemingly from an >>>>>>> interaction between hugetlb_vm_op_open and hugetlb_vmdelete_list. >>>>>>> >>>>>>> Based on some added pr_warn, we believe the following is happening: >>>>>>> When hugetlb_vmdelete_list is entered from the child process, >>>>>>> vma->vm_private_data is NULL, and hence hugetlb_vma_trylock_write does >>>>>>> not lock, since neither __vma_shareable_lock nor __vma_private_lock >>>>>>> are true. >>>>>>> >>>>>>> While hugetlb_vmdelete_list is executing, the parent process does >>>>>>> fork(), which ends up in hugetlb_vm_op_open, which in turn allocates a >>>>>>> lock for the same vma. >>>>>>> >>>>>>> Thus, when the hugetlb_vmdelete_list in the child reaches the end of >>>>>>> the function, vma->vm_private_data is now populated, and hence >>>>>>> hugetlb_vma_unlock_write tries to unlock the vma_lock, which it does >>>>>>> not hold. >>>>>> >>>>>> Thanks for your report. ->vm_private_data was introduced since the >>>>>> series [1]. So I suspect it was caused by this. But I haven't reviewed >>>>>> that at that time (actually, it is a little complex in pmd sharing >>>>>> case). I saw Miaohe had reviewed many of those. >>>>>> >>>>>> CC Miaohe, maybe he has some ideas on this. >>>>>> >>>>>> [1] https://lore.kernel.org/all/20220914221810.95771-7-mike.kravetz@xxxxxxxxxx/T/#m2141e4bc30401a8ce490b1965b9bad74e7f791ff >>>>>> >>>>>> Thanks. >>>>>> >>>>>>> >>>>>>> dmesg: >>>>>>> WARNING: bad unlock balance detected! >>>>>>> 6.8.0-rc1+ #24 Not tainted >>>>>>> ------------------------------------- >>>>>>> lock/2613 is trying to release lock (&vma_lock->rw_sema) at: >>>>>>> [<ffffffffa94c6128>] hugetlb_vma_unlock_write+0x48/0x60 >>>>>>> but there are no more locks to release! >>>>> >>>>> Thanks for your report. It seems there's a race: >>>>> >>>>> CPU 1 CPU 2 >>>>> fork hugetlbfs_fallocate >>>>> dup_mmap hugetlbfs_punch_hole >>>>> i_mmap_lock_write(mapping); >>>>> vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree. >>>>> i_mmap_unlock_write(mapping); >>>>> hugetlb_dup_vma_private -- Clear vma_lock outside i_mmap_rwsem! i_mmap_lock_write(mapping); >>>>> hugetlb_vmdelete_list >>>>> vma_interval_tree_foreach >>>>> hugetlb_vma_trylock_write -- Vma_lock is cleared. >>>>> tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem! >>>>> hugetlb_vma_unlock_write -- Vma_lock is assigned!!! >>>>> i_mmap_unlock_write(mapping); >>>>> >>>>> hugetlb_dup_vma_private and hugetlb_vm_op_open are called outside i_mmap_rwsem lock. So there will be another bugs behind it. >>>>> But I'm not really sure. I will take a more closed look at next week. >>>> >>>> >>>> This can be fixed by deferring vma_interval_tree_insert_after() until vma is fully initialized. >>>> But I'm not sure whether there're side effects with this patch. >>>> >>>> linux-UJMmTI:/home/linmiaohe/mm # git diff >>>> diff --git a/kernel/fork.c b/kernel/fork.c >>>> index 47ff3b35352e..2ef2711452e0 100644 >>>> --- a/kernel/fork.c >>>> +++ b/kernel/fork.c >>>> @@ -712,21 +712,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, >>>> } else if (anon_vma_fork(tmp, mpnt)) >>>> goto fail_nomem_anon_vma_fork; >>>> vm_flags_clear(tmp, VM_LOCKED_MASK); >>>> - file = tmp->vm_file; >>>> - if (file) { >>>> - struct address_space *mapping = file->f_mapping; >>>> - >>>> - get_file(file); >>>> - i_mmap_lock_write(mapping); >>>> - if (vma_is_shared_maywrite(tmp)) >>>> - mapping_allow_writable(mapping); >>>> - flush_dcache_mmap_lock(mapping); >>>> - /* insert tmp into the share list, just after mpnt */ >>>> - vma_interval_tree_insert_after(tmp, mpnt, >>>> - &mapping->i_mmap); >>>> - flush_dcache_mmap_unlock(mapping); >>>> - i_mmap_unlock_write(mapping); >>>> - } >>>> >>>> /* >>>> * Copy/update hugetlb private vma information. >>>> @@ -747,6 +732,22 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, >>>> if (tmp->vm_ops && tmp->vm_ops->open) >>>> tmp->vm_ops->open(tmp); >>>> >>>> + file = tmp->vm_file; >>>> + if (file) { >>>> + struct address_space *mapping = file->f_mapping; >>>> + >>>> + get_file(file); >>>> + i_mmap_lock_write(mapping); >>>> + if (vma_is_shared_maywrite(tmp)) >>>> + mapping_allow_writable(mapping); >>>> + flush_dcache_mmap_lock(mapping); >>>> + /* insert tmp into the share list, just after mpnt. */ >>>> + vma_interval_tree_insert_after(tmp, mpnt, >>>> + &mapping->i_mmap); >>>> + flush_dcache_mmap_unlock(mapping); >>>> + i_mmap_unlock_write(mapping); >>>> + } >>>> + >>>> if (retval) { >>>> mpnt = vma_next(&vmi); >>>> goto loop_out; >>>> >>>> >>> >>> How is this possible? I thought, as specified in mm/rmap.c, that the >>> hugetlbfs path would be holding the mmap lock (which is also held in the >>> fork path)? >> >> The fork path holds the mmap lock from parent A and other childs(except first child B) while hugetlbfs path >> holds the mmap lock from first child B. So the mmap lock won't help here because it comes from different mm. >> Or am I miss something? > > You are correct. It is also in mm/rmap.c: > * hugetlbfs PageHuge() take locks in this order: > * hugetlb_fault_mutex (hugetlbfs specific page fault mutex) > * vma_lock (hugetlb specific lock for pmd_sharing) > * mapping->i_mmap_rwsem (also used for hugetlb pmd sharing) > * page->flags PG_locked (lock_page) > > Does it make sense for hugetlb_dup_vma_private() to assert > mapping->i_mmap_rwsem is locked? When is that necessary? I'm afraid not. AFAICS, vma_lock(vma->vm_private_data) is only modified at the time of vma creating or destroy. Vma_lock is not supposed to be used at that time. > > I also think it might be safer to move the hugetlb_dup_vma_private() > call up instead of the insert into the interval tree down? > See the following comment from mmap.c: > > /* > * Put into interval tree now, so instantiated pages > * are visible to arm/parisc __flush_dcache_page > * throughout; but we cannot insert into address > * space until vma start or end is updated. > */ > > So there may be arch dependent reasons for this order. Yes, it should be safer to move hugetlb_dup_vma_private() call up. But we also need to move tmp->vm_ops->open(tmp) call up. Or the race still exists: CPU 1 CPU 2 fork hugetlbfs_fallocate dup_mmap hugetlbfs_punch_hole hugetlb_dup_vma_private -- Clear vma_lock. <-- it is moved up. i_mmap_lock_write(mapping); vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree. i_mmap_unlock_write(mapping); i_mmap_lock_write(mapping); hugetlb_vmdelete_list vma_interval_tree_foreach hugetlb_vma_trylock_write -- Vma_lock is already cleared. tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem! hugetlb_vma_unlock_write -- Vma_lock is assigned!!! i_mmap_unlock_write(mapping); My patch should not be a complete solution. It's used to prove and fix the race quickly. It's very great if you or someone else can provide a better and safer solution. Thanks. > > Thanks, > Liam > > . >