+Linus and Ben On Sun, Sep 05, 2021, syzbot wrote: > ------------[ cut here ]------------ > WARNING: CPU: 0 PID: 8419 at mm/util.c:597 kvmalloc_node+0x111/0x120 mm/util.c:597 > Modules linked in: > CPU: 0 PID: 8419 Comm: syz-executor520 Not tainted 5.14.0-syzkaller #0 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 > RIP: 0010:kvmalloc_node+0x111/0x120 mm/util.c:597 ... > Call Trace: > kvmalloc include/linux/mm.h:806 [inline] > kvmalloc_array include/linux/mm.h:824 [inline] > kvcalloc include/linux/mm.h:829 [inline] > memslot_rmap_alloc+0xf6/0x310 arch/x86/kvm/x86.c:11320 > kvm_alloc_memslot_metadata arch/x86/kvm/x86.c:11388 [inline] > kvm_arch_prepare_memory_region+0x48d/0x610 arch/x86/kvm/x86.c:11462 > kvm_set_memslot+0xfe/0x1700 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1505 > __kvm_set_memory_region+0x761/0x10e0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1668 > kvm_set_memory_region arch/x86/kvm/../../../virt/kvm/kvm_main.c:1689 [inline] > kvm_vm_ioctl_set_memory_region arch/x86/kvm/../../../virt/kvm/kvm_main.c:1701 [inline] > kvm_vm_ioctl+0x4c6/0x2330 arch/x86/kvm/../../../virt/kvm/kvm_main.c:4236 KVM is tripping the WARN_ON_ONCE(size > INT_MAX) added in commit 7661809d493b ("mm: don't allow oversized kvmalloc() calls"). The allocation size is absurd and doomed to fail in this particular configuration (syzkaller is just throwing garbage at KVM), but for humongous virtual machines it's feasible that KVM could run afoul of the sanity check for an otherwise legitimate allocation. The allocation in question is for KVM's "rmap" to translate a guest pfn to a host virtual address. The size of the rmap in question is an unsigned long per 4kb page in a memslot, i.e. on x86-64, 8 bytes per 4096 bytes of guest memory in a memslot. With INT_MAX=0x7fffffff, KVM will trip the WARN and fail rmap allocations for memslots >= 1tb, and Google already has VMs that create 1.5tb memslots (12tb of total guest memory spread across 8 virtual NUMA nodes). One caveat is that KVM's newfangled "TDP MMU" was designed specifically to avoid the rmap allocation (among other things), precisely because of its scalability issues. I.e. it's unlikely KVM's so called "legacy MMU" that relies on the rmaps would be used for such large VMs. However, KVM's legacy MMU is still the only option for shadowing nested EPT/NPT, i.e. the rmap allocation would be problematic if/when nested virtualization is enabled in large VMs. KVM also has other allocations based on memslot size that are _not_ avoided by KVM's TDP MMU and may eventually be problematic, though presumably not for quite some time as it would require petabyte? memslots. E.g. a different metadata array requires 4 bytes per 2mb of guest memory. I don't have any clever ideas to handle this from the KVM side, at least not in the short term. Long term, I think it would be doable to reduce the rmap size for large memslots by 512x, but any change of that nature would be very invasive to KVM and be fairly risky. It also wouldn't prevent syskaller from triggering this WARN at will.