hugetlbfs: WARNING: bad unlock balance detected during MADV_REMOVE

Thorvald Natvig <thorvald@xxxxxxxxxx> · Thu, 25 Jan 2024 12:28:40 -0800

We've found what appears to be a lock issue that results in a blocked
process somewhere in hugetlbfs for shared maps; seemingly from an
interaction between hugetlb_vm_op_open and hugetlb_vmdelete_list.

Based on some added pr_warn, we believe the following is happening:
When hugetlb_vmdelete_list is entered from the child process,
vma->vm_private_data is NULL, and hence hugetlb_vma_trylock_write does
not lock, since neither __vma_shareable_lock nor __vma_private_lock
are true.

While hugetlb_vmdelete_list is executing, the parent process does
fork(), which ends up in hugetlb_vm_op_open, which in turn allocates a
lock for the same vma.

Thus, when the hugetlb_vmdelete_list in the child reaches the end of
the function, vma->vm_private_data is now populated, and hence
hugetlb_vma_unlock_write tries to unlock the vma_lock, which it does
not hold.

dmesg:
WARNING: bad unlock balance detected!
6.8.0-rc1+ #24 Not tainted
-------------------------------------
lock/2613 is trying to release lock (&vma_lock->rw_sema) at:
[<ffffffffa94c6128>] hugetlb_vma_unlock_write+0x48/0x60
but there are no more locks to release!

3 locks held by lock/2613:
 #0: ffff9b4bc6225450 (sb_writers#16){.+.+}-{0:0}, at:
madvise_vma_behavior+0x4cc/0xcf0
 #1: ffff9ba4dc34eca0 (&sb->s_type->i_mutex_key#23){+.+.}-{3:3}, at:
hugetlbfs_fallocate+0x3fe/0x620
 #2: ffff9ba4dc34ef38 (&hugetlbfs_i_mmap_rwsem_key){+.+.}-{3:3}, at:
hugetlbfs_fallocate+0x438/0x620

CPU: 17 PID: 2613 Comm: lock Not tainted 6.8.0-rc1+ #24
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 12/02/2023
Call Trace:
 <TASK>
 dump_stack_lvl+0x77/0xe0
 ? hugetlb_vma_unlock_write+0x48/0x60
 dump_stack+0x10/0x20
 print_unlock_imbalance_bug+0x127/0x150
 lock_release+0x21a/0x3f0
 ? hugetlb_vma_unlock_write+0x48/0x60
 up_write+0x1c/0x1d0
 hugetlb_vma_unlock_write+0x48/0x60
 hugetlb_vmdelete_list+0x93/0xd0
 hugetlbfs_fallocate+0x4e1/0x620
 vfs_fallocate+0x153/0x4b0
 madvise_vma_behavior+0x4cc/0xcf0
 ? mas_prev+0x68/0x70
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? find_vma_prev+0x78/0xc0
 ? __pfx_madvise_vma_behavior+0x10/0x10
 madvise_walk_vmas+0xc4/0x140
 do_madvise+0x3df/0x450
 __x64_sys_madvise+0x2c/0x40
 do_syscall_64+0x8e/0x160
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? do_syscall_64+0x9b/0x160
 ? do_syscall_64+0x9b/0x160
 ? do_syscall_64+0x9b/0x160
 entry_SYSCALL_64_after_hwframe+0x6e/0x76
RIP: 0033:0x7f55e0b23bbb

Repro:

#include <signal.h>
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <unistd.h>

#define PSIZE (2048UL * 1024UL)

int main(int argc, char **argv) {
  char *buffer = mmap(NULL, PSIZE, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_SHARED | MAP_HUGETLB, -1, 0);
  if (buffer == MAP_FAILED) {
    perror("mmap");
    exit(1);
  }

  pid_t remover = fork();

  if (remover == 0) {
    while(1) {
      if (madvise(buffer, PSIZE, MADV_REMOVE) == -1) {
        perror("madvise");
        exit(1);
      }
    }
  }

  int wstatus;

  for(int l = 0; l < 10000; ++l) {
    pid_t childpid = fork();
    if (childpid == 0) {
      exit(0);
    } else {
      waitpid(childpid, &wstatus, 0);
    }
  }

  kill(remover, SIGKILL);
  waitpid(remover, &wstatus, 0);
  printf("Clean exit\n");
}

- Thorvald