All system RAM is tracked by a metadata structure called 'struct page' which amounts to 64bytes and represents a certain page granualarity. On x86 (or systems which PAGE_SIZE is 4K) this data structure represents a total of 1.5% overhead of total capacity. For hypervisors -- specially those without vhost/PV-devices, and just VFs -- persistent/volatile memory is largely assigned to userspace without kernel taking part in any of it's I/O paths, except for VFIO. 1.5% may not seem like much, but it is still a total of 16G per Tb just for struct page, which is a lot considering the hypervisor won't need it and instead should be used to create more guests (=Happy Users). The RFC patches submitted here [0] approach this through device-dax given the interface it provides already for VMMs and also given that this is too a source of overhead for non-volatile memory assigned to guests. Essentially it extends device-dax to create a PFNMAP vma with special pages (while adding support for huge special pages). host memory would be limited through some form of mem=X, efi_fake_mem=Y@X:0x40000 or memmap=Y@X-1+0xefffffff i.e. dedicate Y amount for guests memory. Should vhost-{net,scsi,etc} be used, we copy from/to guest memory (which works today for vhost-net, and easily adjusted for vhost-scsi), or perhaps explore dynamically creating/freeing struct pages on GUP temporary pinning. This topic would be to brainstorm the idea/proposal and also discuss alternatives/pitfalls/limitations/other-usecases(*). Regards, Joao (*) To some extent there might be a similarity to '"Secret" memory userspace APIs' subitem of this previously submitted topic[1] given that the guest memory in the described topic isn't part of the direct map. [0] https://lore.kernel.org/linux-mm/20200110190313.17144-1-joao.m.martins@xxxxxxxxxx/ [1] https://lore.kernel.org/linux-mm/20200206165900.GD17499@xxxxxxxxxxxxx/