Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

Hugh Dickins <hughd@xxxxxxxxxx> · Thu, 18 Aug 2022 20:00:41 -0700 (PDT)

On Thu, 18 Aug 2022, Kirill A . Shutemov wrote:
> On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> > 
> > If your memory could be swapped, that would be enough of a good reason
> > to make use of shmem.c: but it cannot be swapped; and although there
> > are some references in the mailthreads to it perhaps being swappable
> > in future, I get the impression that will not happen soon if ever.
> > 
> > If your memory could be migrated, that would be some reason to use
> > filesystem page cache (because page migration happens to understand
> > that type of memory): but it cannot be migrated.
> 
> Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping
> theoretically possible, but I'm not aware of any plans as of now.
> 
> [1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html

I always forget, migration means different things to different audiences.
As an mm person, I was meaning page migration, whereas a virtualization
person thinks VM live migration (which that reference appears to be about),
a scheduler person task migration, an ornithologist bird migration, etc.

But you're an mm person too: you may have cited that reference in the
knowledge that TDX 1.5 Live Migration will entail page migration of the
kind I'm thinking of.  (Anyway, it's not important to clarify that here.)

> 
> > Some of these impressions may come from earlier iterations of the
> > patchset (v7 looks better in several ways than v5).  I am probably
> > underestimating the extent to which you have taken on board other
> > usages beyond TDX and SEV private memory, and rightly want to serve
> > them all with similar interfaces: perhaps there is enough justification
> > for shmem there, but I don't see it.  There was mention of userfaultfd
> > in one link: does that provide the justification for using shmem?
> > 
> > I'm afraid of the special demands you may make of memory allocation
> > later on - surprised that huge pages are not mentioned already;
> > gigantic contiguous extents? secretmem removed from direct map?
> 
> The design allows for extension to hugetlbfs if needed. Combination of
> MFD_INACCESSIBLE | MFD_HUGETLB should route this way. There should be zero
> implications for shmem. It is going to be separate struct memfile_backing_store.

Last year's MFD_HUGEPAGE proposal would have allowed you to do it with
memfd via tmpfs without needing to involve hugetlbfs; but you may prefer
the determinism of hugetlbfs, relying on /proc/sys/vm/nr_hugepages etc.

But I've yet to see why you want to involve this or that filesystem
(with all its filesystem-icity suppressed) at all.  The backing store
is host memory, and tmpfs and hugetlbfs just impose their own
idiosyncrasies on how that memory is allocated; but I think you would
do better to choose your own idiosyncrasies in allocation directly -
you don't need a different "backing store" to choose between 4k or 2M
or 1G or whatever allocations.

tmpfs and hugetlbfs and page cache are designed around sharing memory:
TDX is designed around absolutely not sharing memory; and the further
uses which Sean foresees appear not to need it as page cache either.

Except perhaps for page migration reasons.  It's somewhat incidental,  
but of course page migration knows how to migrate page cache, so
masquerading as page cache will give a short cut to page migration,
when page migration becomes at all possible.

> 
> I'm not sure secretmem is a fit here as we want to extend MFD_INACCESSIBLE
> to be movable if platform supports it and secretmem is not migratable by
> design (without direct mapping fragmentations).
> 
> > Here's what I would prefer, and imagine much easier for you to maintain;
> > but I'm no system designer, and may be misunderstanding throughout.
> > 
> > QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps
> > the fallocate syscall interface itself) to allocate and free the memory,
> > ioctl for initializing some of it too.  KVM in control of whether that
> > fd can be read or written or mmap'ed or whatever, no need to prevent it
> > in shmem.c, no need for flags, seals, notifications to and fro because
> > KVM is already in control and knows the history.  If shmem actually has
> > value, call into it underneath - somewhat like SysV SHM, and /dev/zero
> > mmap, and i915/gem make use of it underneath.  If shmem has nothing to
> > add, just allocate and free kernel memory directly, recorded in your
> > own xarray.
> 
> I guess shim layer on top of shmem *can* work. I don't see immediately why
> it would not. But I'm not sure it is right direction. We risk creating yet
> another parallel VM with own rules/locking/accounting that opaque to
> core-mm.

You are already proposing a new set of rules, foreign to how tmpfs works
for others.  You're right that KVM allocating large amounts of memory,
opaque to core-mm, carries risk: and you'd be right to say that shmem.c
provides some clues (security_vm_enough_memory checks, memcg charging,
user_shm_lock accounting) on what to remember.

But I'm not up to the job of being the one to police you there,
and you don't want to be waiting on me either.

To take a rather silly example: Ted just added chattr support to tmpfs,
and it fits in well.  But I don't now want to have to decide whether
"chattr +i" FS_IMMUTABLE_FL is or is not compatible with
MEMFILE_F_USER_INACCESSIBLE.  They are from different worlds,
and I'd prefer KVM to carry the weight of imposing INACCESSIBLE:
which seems easily done if it manages the fd, without making the
memory allocated to that fd accessible to those who hold the fd.

> 
> Note that on machines that run TDX guests such memory would likely be the
> bulk of memory use. Treating it as a fringe case may bite us one day.

Yes, I suspected that machines running TDX guests might well consume
most of the memory that way, but glad(?) to hear it confirmed.

I am not suggesting that this memory be treated as a fringe case, rather
the reverse: a different case, not something to hide away inside shmem.c.

Is there a notion that /proc/meminfo "Shmem:" is going to be a good hint
of this usage?  Whether or not it's also included in "Shmem:", I expect
that its different characteristics will deserve its own display.

Hugh