On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote: > On Wed, 6 Jul 2022, Chao Peng wrote: > > This is the v7 of this series which tries to implement the fd-based KVM > > guest private memory. > > Here at last are my reluctant thoughts on this patchset. > > fd-based approach for supporting KVM guest private memory: fine. > > Use or abuse of memfd and shmem.c: mistaken. > > memfd_create() was an excellent way to put together the initial prototype. > > But since then, TDX in particular has forced an effort into preventing > (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs. > > Are any of the shmem.c mods useful to existing users of shmem.c? No. > Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No. > > What use do you have for a filesystem here? Almost none. > IIUC, what you want is an fd through which QEMU can allocate kernel > memory, selectively free that memory, and communicate fd+offset+length > to KVM. And perhaps an interface to initialize a little of that memory > from a template (presumably copied from a real file on disk somewhere). > > You don't need shmem.c or a filesystem for that! > > If your memory could be swapped, that would be enough of a good reason > to make use of shmem.c: but it cannot be swapped; and although there > are some references in the mailthreads to it perhaps being swappable > in future, I get the impression that will not happen soon if ever. > > If your memory could be migrated, that would be some reason to use > filesystem page cache (because page migration happens to understand > that type of memory): but it cannot be migrated. Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping theoretically possible, but I'm not aware of any plans as of now. [1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html > Some of these impressions may come from earlier iterations of the > patchset (v7 looks better in several ways than v5). I am probably > underestimating the extent to which you have taken on board other > usages beyond TDX and SEV private memory, and rightly want to serve > them all with similar interfaces: perhaps there is enough justification > for shmem there, but I don't see it. There was mention of userfaultfd > in one link: does that provide the justification for using shmem? > > I'm afraid of the special demands you may make of memory allocation > later on - surprised that huge pages are not mentioned already; > gigantic contiguous extents? secretmem removed from direct map? The design allows for extension to hugetlbfs if needed. Combination of MFD_INACCESSIBLE | MFD_HUGETLB should route this way. There should be zero implications for shmem. It is going to be separate struct memfile_backing_store. I'm not sure secretmem is a fit here as we want to extend MFD_INACCESSIBLE to be movable if platform supports it and secretmem is not migratable by design (without direct mapping fragmentations). > Here's what I would prefer, and imagine much easier for you to maintain; > but I'm no system designer, and may be misunderstanding throughout. > > QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps > the fallocate syscall interface itself) to allocate and free the memory, > ioctl for initializing some of it too. KVM in control of whether that > fd can be read or written or mmap'ed or whatever, no need to prevent it > in shmem.c, no need for flags, seals, notifications to and fro because > KVM is already in control and knows the history. If shmem actually has > value, call into it underneath - somewhat like SysV SHM, and /dev/zero > mmap, and i915/gem make use of it underneath. If shmem has nothing to > add, just allocate and free kernel memory directly, recorded in your > own xarray. I guess shim layer on top of shmem *can* work. I don't see immediately why it would not. But I'm not sure it is right direction. We risk creating yet another parallel VM with own rules/locking/accounting that opaque to core-mm. Note that on machines that run TDX guests such memory would likely be the bulk of memory use. Treating it as a fringe case may bite us one day. -- Kiryl Shutsemau / Kirill A. Shutemov