> On Jan 26, 2024, at 04:28, Thorvald Natvig <thorvald@xxxxxxxxxx> wrote: > > We've found what appears to be a lock issue that results in a blocked > process somewhere in hugetlbfs for shared maps; seemingly from an > interaction between hugetlb_vm_op_open and hugetlb_vmdelete_list. > > Based on some added pr_warn, we believe the following is happening: > When hugetlb_vmdelete_list is entered from the child process, > vma->vm_private_data is NULL, and hence hugetlb_vma_trylock_write does > not lock, since neither __vma_shareable_lock nor __vma_private_lock > are true. > > While hugetlb_vmdelete_list is executing, the parent process does > fork(), which ends up in hugetlb_vm_op_open, which in turn allocates a > lock for the same vma. > > Thus, when the hugetlb_vmdelete_list in the child reaches the end of > the function, vma->vm_private_data is now populated, and hence > hugetlb_vma_unlock_write tries to unlock the vma_lock, which it does > not hold. Thanks for your report. ->vm_private_data was introduced since the series [1]. So I suspect it was caused by this. But I haven't reviewed that at that time (actually, it is a little complex in pmd sharing case). I saw Miaohe had reviewed many of those. CC Miaohe, maybe he has some ideas on this. [1] https://lore.kernel.org/all/20220914221810.95771-7-mike.kravetz@xxxxxxxxxx/T/#m2141e4bc30401a8ce490b1965b9bad74e7f791ff Thanks. > > dmesg: > WARNING: bad unlock balance detected! > 6.8.0-rc1+ #24 Not tainted > ------------------------------------- > lock/2613 is trying to release lock (&vma_lock->rw_sema) at: > [<ffffffffa94c6128>] hugetlb_vma_unlock_write+0x48/0x60 > but there are no more locks to release! > > > 3 locks held by lock/2613: > #0: ffff9b4bc6225450 (sb_writers#16){.+.+}-{0:0}, at: > madvise_vma_behavior+0x4cc/0xcf0 > #1: ffff9ba4dc34eca0 (&sb->s_type->i_mutex_key#23){+.+.}-{3:3}, at: > hugetlbfs_fallocate+0x3fe/0x620 > #2: ffff9ba4dc34ef38 (&hugetlbfs_i_mmap_rwsem_key){+.+.}-{3:3}, at: > hugetlbfs_fallocate+0x438/0x620 > > > CPU: 17 PID: 2613 Comm: lock Not tainted 6.8.0-rc1+ #24 > Hardware name: Google Google Compute Engine/Google Compute Engine, > BIOS Google 12/02/2023 > Call Trace: > <TASK> > dump_stack_lvl+0x77/0xe0 > ? hugetlb_vma_unlock_write+0x48/0x60 > dump_stack+0x10/0x20 > print_unlock_imbalance_bug+0x127/0x150 > lock_release+0x21a/0x3f0 > ? hugetlb_vma_unlock_write+0x48/0x60 > up_write+0x1c/0x1d0 > hugetlb_vma_unlock_write+0x48/0x60 > hugetlb_vmdelete_list+0x93/0xd0 > hugetlbfs_fallocate+0x4e1/0x620 > vfs_fallocate+0x153/0x4b0 > madvise_vma_behavior+0x4cc/0xcf0 > ? mas_prev+0x68/0x70 > ? srso_alias_return_thunk+0x5/0xfbef5 > ? find_vma_prev+0x78/0xc0 > ? __pfx_madvise_vma_behavior+0x10/0x10 > madvise_walk_vmas+0xc4/0x140 > do_madvise+0x3df/0x450 > __x64_sys_madvise+0x2c/0x40 > do_syscall_64+0x8e/0x160 > ? srso_alias_return_thunk+0x5/0xfbef5 > ? do_syscall_64+0x9b/0x160 > ? do_syscall_64+0x9b/0x160 > ? do_syscall_64+0x9b/0x160 > entry_SYSCALL_64_after_hwframe+0x6e/0x76 > RIP: 0033:0x7f55e0b23bbb > > Repro: > > #include <signal.h> > #include <stddef.h> > #include <stdio.h> > #include <stdlib.h> > #include <sys/mman.h> > #include <sys/wait.h> > #include <unistd.h> > > #define PSIZE (2048UL * 1024UL) > > int main(int argc, char **argv) { > char *buffer = mmap(NULL, PSIZE, PROT_READ | PROT_WRITE, > MAP_ANONYMOUS | MAP_SHARED | MAP_HUGETLB, -1, 0); > if (buffer == MAP_FAILED) { > perror("mmap"); > exit(1); > } > > pid_t remover = fork(); > > if (remover == 0) { > while(1) { > if (madvise(buffer, PSIZE, MADV_REMOVE) == -1) { > perror("madvise"); > exit(1); > } > } > } > > int wstatus; > > for(int l = 0; l < 10000; ++l) { > pid_t childpid = fork(); > if (childpid == 0) { > exit(0); > } else { > waitpid(childpid, &wstatus, 0); > } > } > > kill(remover, SIGKILL); > waitpid(remover, &wstatus, 0); > printf("Clean exit\n"); > } > > - Thorvald