On Tue, Sep 7, 2021 at 10:30 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > +Linus and Ben > > On Sun, Sep 05, 2021, syzbot wrote: > > ------------[ cut here ]------------ > > WARNING: CPU: 0 PID: 8419 at mm/util.c:597 kvmalloc_node+0x111/0x120 mm/util.c:597 > > Modules linked in: > > CPU: 0 PID: 8419 Comm: syz-executor520 Not tainted 5.14.0-syzkaller #0 > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 > > RIP: 0010:kvmalloc_node+0x111/0x120 mm/util.c:597 > > ... > > > Call Trace: > > kvmalloc include/linux/mm.h:806 [inline] > > kvmalloc_array include/linux/mm.h:824 [inline] > > kvcalloc include/linux/mm.h:829 [inline] > > memslot_rmap_alloc+0xf6/0x310 arch/x86/kvm/x86.c:11320 > > kvm_alloc_memslot_metadata arch/x86/kvm/x86.c:11388 [inline] > > kvm_arch_prepare_memory_region+0x48d/0x610 arch/x86/kvm/x86.c:11462 > > kvm_set_memslot+0xfe/0x1700 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1505 > > __kvm_set_memory_region+0x761/0x10e0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1668 > > kvm_set_memory_region arch/x86/kvm/../../../virt/kvm/kvm_main.c:1689 [inline] > > kvm_vm_ioctl_set_memory_region arch/x86/kvm/../../../virt/kvm/kvm_main.c:1701 [inline] > > kvm_vm_ioctl+0x4c6/0x2330 arch/x86/kvm/../../../virt/kvm/kvm_main.c:4236 > > KVM is tripping the WARN_ON_ONCE(size > INT_MAX) added in commit 7661809d493b > ("mm: don't allow oversized kvmalloc() calls"). The allocation size is absurd and > doomed to fail in this particular configuration (syzkaller is just throwing garbage > at KVM), but for humongous virtual machines it's feasible that KVM could run afoul > of the sanity check for an otherwise legitimate allocation. > > The allocation in question is for KVM's "rmap" to translate a guest pfn to a host > virtual address. The size of the rmap in question is an unsigned long per 4kb page > in a memslot, i.e. on x86-64, 8 bytes per 4096 bytes of guest memory in a memslot. > With INT_MAX=0x7fffffff, KVM will trip the WARN and fail rmap allocations for > memslots >= 1tb, and Google already has VMs that create 1.5tb memslots (12tb of > total guest memory spread across 8 virtual NUMA nodes). > > One caveat is that KVM's newfangled "TDP MMU" was designed specifically to avoid > the rmap allocation (among other things), precisely because of its scalability > issues. I.e. it's unlikely KVM's so called "legacy MMU" that relies on the rmaps > would be used for such large VMs. However, KVM's legacy MMU is still the only option > for shadowing nested EPT/NPT, i.e. the rmap allocation would be problematic if/when > nested virtualization is enabled in large VMs. > > KVM also has other allocations based on memslot size that are _not_ avoided by KVM's > TDP MMU and may eventually be problematic, though presumably not for quite some time > as it would require petabyte? memslots. E.g. a different metadata array requires > 4 bytes per 2mb of guest memory. KVM's dirty bitmap requires 1 bit per 4K, so we'd hit this limit even sooner with 64TB memslots. Still, that can be avoided with Peter Xu's dirty ring and we're still a ways away from 64TB memslots. > > I don't have any clever ideas to handle this from the KVM side, at least not in the > short term. Long term, I think it would be doable to reduce the rmap size for large > memslots by 512x, but any change of that nature would be very invasive to KVM and > be fairly risky. It also wouldn't prevent syskaller from triggering this WARN at will. Not the most elegant solution, but KVM could, and perhaps should, impose a maximum memslot size. KVM operations (e.g. dirty logging) which operate on a memslot can take a very long time with terabyte memslots. Forcing userspace to handle memory in units of a more reasonable size could be a good limitation to impose sooner rather than later while there are few users (if any outside Google) of these massive memslots.