Re: [syzbot] WARNING: kmalloc bug in memslot_rmap_alloc

Ben Gardon <bgardon@xxxxxxxxxx> · Tue, 7 Sep 2021 11:05:27 -0700

On Tue, Sep 7, 2021 at 10:30 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> +Linus and Ben
>
> On Sun, Sep 05, 2021, syzbot wrote:
> > ------------[ cut here ]------------
> > WARNING: CPU: 0 PID: 8419 at mm/util.c:597 kvmalloc_node+0x111/0x120 mm/util.c:597
> > Modules linked in:
> > CPU: 0 PID: 8419 Comm: syz-executor520 Not tainted 5.14.0-syzkaller #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> > RIP: 0010:kvmalloc_node+0x111/0x120 mm/util.c:597
>
> ...
>
> > Call Trace:
> >  kvmalloc include/linux/mm.h:806 [inline]
> >  kvmalloc_array include/linux/mm.h:824 [inline]
> >  kvcalloc include/linux/mm.h:829 [inline]
> >  memslot_rmap_alloc+0xf6/0x310 arch/x86/kvm/x86.c:11320
> >  kvm_alloc_memslot_metadata arch/x86/kvm/x86.c:11388 [inline]
> >  kvm_arch_prepare_memory_region+0x48d/0x610 arch/x86/kvm/x86.c:11462
> >  kvm_set_memslot+0xfe/0x1700 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1505
> >  __kvm_set_memory_region+0x761/0x10e0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1668
> >  kvm_set_memory_region arch/x86/kvm/../../../virt/kvm/kvm_main.c:1689 [inline]
> >  kvm_vm_ioctl_set_memory_region arch/x86/kvm/../../../virt/kvm/kvm_main.c:1701 [inline]
> >  kvm_vm_ioctl+0x4c6/0x2330 arch/x86/kvm/../../../virt/kvm/kvm_main.c:4236
>
> KVM is tripping the WARN_ON_ONCE(size > INT_MAX) added in commit 7661809d493b
> ("mm: don't allow oversized kvmalloc() calls").  The allocation size is absurd and
> doomed to fail in this particular configuration (syzkaller is just throwing garbage
> at KVM), but for humongous virtual machines it's feasible that KVM could run afoul
> of the sanity check for an otherwise legitimate allocation.
>
> The allocation in question is for KVM's "rmap" to translate a guest pfn to a host
> virtual address.  The size of the rmap in question is an unsigned long per 4kb page
> in a memslot, i.e. on x86-64, 8 bytes per 4096 bytes of guest memory in a memslot.
> With INT_MAX=0x7fffffff, KVM will trip the WARN and fail rmap allocations for
> memslots >= 1tb, and Google already has VMs that create 1.5tb memslots (12tb of
> total guest memory spread across 8 virtual NUMA nodes).
>
> One caveat is that KVM's newfangled "TDP MMU" was designed specifically to avoid
> the rmap allocation (among other things), precisely because of its scalability
> issues.  I.e. it's unlikely KVM's so called "legacy MMU" that relies on the rmaps
> would be used for such large VMs.  However, KVM's legacy MMU is still the only option
> for shadowing nested EPT/NPT, i.e. the rmap allocation would be problematic if/when
> nested virtualization is enabled in large VMs.
>
> KVM also has other allocations based on memslot size that are _not_ avoided by KVM's
> TDP MMU and may eventually be problematic, though presumably not for quite some time
> as it would require petabyte? memslots.  E.g. a different metadata array requires
> 4 bytes per 2mb of guest memory.

KVM's dirty bitmap requires 1 bit per 4K, so we'd hit this limit even
sooner with 64TB memslots.
Still, that can be avoided with Peter Xu's dirty ring and we're still
a ways away from 64TB memslots.

>
> I don't have any clever ideas to handle this from the KVM side, at least not in the
> short term.  Long term, I think it would be doable to reduce the rmap size for large
> memslots by 512x, but any change of that nature would be very invasive to KVM and
> be fairly risky.  It also wouldn't prevent syskaller from triggering this WARN at will.

Not the most elegant solution, but KVM could, and perhaps should,
impose a maximum memslot size. KVM operations (e.g. dirty logging)
which operate on a memslot can take a very long time with terabyte
memslots. Forcing userspace to handle memory in units of a more
reasonable size could be a good limitation to impose sooner rather
than later while there are few users (if any outside Google) of these
massive memslots.