[LSF/MM TOPIC] Guest memory without struct page

Joao Martins <joao.m.martins@xxxxxxxxxx> · Fri, 14 Feb 2020 21:32:11 +0000

All system RAM is tracked by a metadata structure called 'struct page' which
amounts to 64bytes and represents a certain page granualarity. On x86 (or
systems which PAGE_SIZE is 4K) this data structure represents a total of 1.5%
overhead of total capacity.

For hypervisors -- specially those without vhost/PV-devices, and just VFs --
persistent/volatile memory is largely assigned to userspace without kernel
taking part in any of it's I/O paths, except for VFIO. 1.5% may not seem like
much, but it is still a total of 16G per Tb just for struct page, which is a lot
considering the hypervisor won't need it and instead should be used to create
more guests (=Happy Users).

The RFC patches submitted here [0] approach this through device-dax given the
interface it provides already for VMMs and also given that this is too a source
of overhead for non-volatile memory assigned to guests. Essentially it extends
device-dax to create a PFNMAP vma with special pages (while adding support for
huge special pages). host memory would be limited through some form of mem=X,
efi_fake_mem=Y@X:0x40000 or memmap=Y@X-1+0xefffffff i.e. dedicate Y amount for
guests memory.

Should vhost-{net,scsi,etc} be used, we copy from/to guest memory (which works
today for vhost-net, and easily adjusted for vhost-scsi), or perhaps explore
dynamically creating/freeing struct pages on GUP temporary pinning.

This topic would be to brainstorm the idea/proposal and also discuss
alternatives/pitfalls/limitations/other-usecases(*).

Regards,
  Joao

(*) To some extent there might be a similarity to '"Secret" memory userspace
APIs' subitem of this previously submitted topic[1] given that the guest memory
in the described topic isn't part of the direct map.

[0]
https://lore.kernel.org/linux-mm/20200110190313.17144-1-joao.m.martins@xxxxxxxxxx/
[1] https://lore.kernel.org/linux-mm/20200206165900.GD17499@xxxxxxxxxxxxx/