We've found what appears to be a lock issue that results in a blocked process somewhere in hugetlbfs for shared maps; seemingly from an interaction between hugetlb_vm_op_open and hugetlb_vmdelete_list. Based on some added pr_warn, we believe the following is happening: When hugetlb_vmdelete_list is entered from the child process, vma->vm_private_data is NULL, and hence hugetlb_vma_trylock_write does not lock, since neither __vma_shareable_lock nor __vma_private_lock are true. While hugetlb_vmdelete_list is executing, the parent process does fork(), which ends up in hugetlb_vm_op_open, which in turn allocates a lock for the same vma. Thus, when the hugetlb_vmdelete_list in the child reaches the end of the function, vma->vm_private_data is now populated, and hence hugetlb_vma_unlock_write tries to unlock the vma_lock, which it does not hold. dmesg: WARNING: bad unlock balance detected! 6.8.0-rc1+ #24 Not tainted ------------------------------------- lock/2613 is trying to release lock (&vma_lock->rw_sema) at: [<ffffffffa94c6128>] hugetlb_vma_unlock_write+0x48/0x60 but there are no more locks to release! 3 locks held by lock/2613: #0: ffff9b4bc6225450 (sb_writers#16){.+.+}-{0:0}, at: madvise_vma_behavior+0x4cc/0xcf0 #1: ffff9ba4dc34eca0 (&sb->s_type->i_mutex_key#23){+.+.}-{3:3}, at: hugetlbfs_fallocate+0x3fe/0x620 #2: ffff9ba4dc34ef38 (&hugetlbfs_i_mmap_rwsem_key){+.+.}-{3:3}, at: hugetlbfs_fallocate+0x438/0x620 CPU: 17 PID: 2613 Comm: lock Not tainted 6.8.0-rc1+ #24 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 12/02/2023 Call Trace: <TASK> dump_stack_lvl+0x77/0xe0 ? hugetlb_vma_unlock_write+0x48/0x60 dump_stack+0x10/0x20 print_unlock_imbalance_bug+0x127/0x150 lock_release+0x21a/0x3f0 ? hugetlb_vma_unlock_write+0x48/0x60 up_write+0x1c/0x1d0 hugetlb_vma_unlock_write+0x48/0x60 hugetlb_vmdelete_list+0x93/0xd0 hugetlbfs_fallocate+0x4e1/0x620 vfs_fallocate+0x153/0x4b0 madvise_vma_behavior+0x4cc/0xcf0 ? mas_prev+0x68/0x70 ? srso_alias_return_thunk+0x5/0xfbef5 ? find_vma_prev+0x78/0xc0 ? __pfx_madvise_vma_behavior+0x10/0x10 madvise_walk_vmas+0xc4/0x140 do_madvise+0x3df/0x450 __x64_sys_madvise+0x2c/0x40 do_syscall_64+0x8e/0x160 ? srso_alias_return_thunk+0x5/0xfbef5 ? do_syscall_64+0x9b/0x160 ? do_syscall_64+0x9b/0x160 ? do_syscall_64+0x9b/0x160 entry_SYSCALL_64_after_hwframe+0x6e/0x76 RIP: 0033:0x7f55e0b23bbb Repro: #include <signal.h> #include <stddef.h> #include <stdio.h> #include <stdlib.h> #include <sys/mman.h> #include <sys/wait.h> #include <unistd.h> #define PSIZE (2048UL * 1024UL) int main(int argc, char **argv) { char *buffer = mmap(NULL, PSIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_SHARED | MAP_HUGETLB, -1, 0); if (buffer == MAP_FAILED) { perror("mmap"); exit(1); } pid_t remover = fork(); if (remover == 0) { while(1) { if (madvise(buffer, PSIZE, MADV_REMOVE) == -1) { perror("madvise"); exit(1); } } } int wstatus; for(int l = 0; l < 10000; ++l) { pid_t childpid = fork(); if (childpid == 0) { exit(0); } else { waitpid(childpid, &wstatus, 0); } } kill(remover, SIGKILL); waitpid(remover, &wstatus, 0); printf("Clean exit\n"); } - Thorvald