Re: [PATCH kvmtool v1 08/17] Use memfd for all guest ram allocations

Fuad Tabba <tabba@xxxxxxxxxx> · Wed, 30 Nov 2022 17:54:00 +0000

Hi,

On Tue, Nov 29, 2022 at 6:10 PM Alexandru Elisei
<alexandru.elisei@xxxxxxx> wrote:
>
> Hi,
>
> On Mon, Nov 28, 2022 at 08:49:29AM +0000, Fuad Tabba wrote:
> > Hi,
> >
> > First I want to mention that I really appreciate your feedback, which
> > has already been quite helpful. I would like you to please consider
> > this to be an RFC, and let's use these patches as a basis for
> > discussion and how they can be improved when I respin them, even if
> > that means waiting until the kvm fd-based proposal is finalized.
>
> For that it's probably best if you add RFC to the subject prefix. That's
> very helpful to let the reviewers know what to focus on, more on the
> approach than on the finer details.

I will respin this as an RFC, and I will include the patches that I
have that support the restricted memory proposal [*] for pKVM as it
stands now. I hope that would help see where I was thinking this would
be heading.

[*] https://lore.kernel.org/all/20221025151344.3784230-1-chao.p.peng@xxxxxxxxxxxxxxx/

<snip>

> > > > With simpler I didn't mean fewer lines of code, rather that it's
> > > > easier to reason about, more shared code. With this series, hugetlb
> > >
> > > How is all of the code that has been added easier to reason about than one
> > > single mmap call?
>
> Would be nice if this would be answered.

As I said in a reply to a different comment, for me personally, as a first time
kvmtool contributor, it was easier for me to reason about the memory
when the canonical reference to the memory is a file descriptor that
would not change, rather than a userspace memory address which could
change as it is aligned and trimmed.

I use the word simpler subjectively, that is, in my opinion.

<snip>

> >
> > Because we are sure that it will be fd-based, and because I thought
> > that getting a head start to set the scene would be helpful. The part
> > that is uncertain is the kvm capabilities, flags, and names of the new
> > memory region extensions, none of which I address in these patches.
>
> I see, that makes sense. My feedback so far is that you haven't provided a
> good reason why this change to anonymous memory makes sense right now.

I appreciate your feedback, and I hope we can continue this discussion
when I respin this as an RFC.

Cheers,
/fuad

> Thanks,
> Alex
>
> >
> > Cheers,
> > /fuad
> >
> > > Thanks,
> > > Alex
> > >
> > > >
> > > > Cheers,
> > > > /fuad
> > > >
> > > >
> > > >
> > > > > Thanks,
> > > > > Alex
> > > > >
> > > > > > could have a look on how I would go ahead building on these patches
> > > > > > for full support of private memory backed by an fd.
> > > > > >
> > > > > > > Regarding IPC memory sharing, is mmap'ing an memfd file enough to enable
> > > > > > > that? If more work is needed for it, then wouldn't it make more sense to do
> > > > > > > all the changes at once? This change might look sensible right now, but it
> > > > > > > might turn out that it was the wrong way to go about it when someone
> > > > > > > actually starts implementing memory sharing.
> > > > > >
> > > > > > I don’t plan on supporting IPC memory sharing. I just mentioned that
> > > > > > as yet another use case that would benefit from guest memory being
> > > > > > fd-based, should kvmtool decide to support it in the future.
> > > > > >
> > > > > > Cheers,
> > > > > > /fuad
> > > > > >
> > > > > > [1] https://lore.kernel.org/all/20221025151344.3784230-1-chao.p.peng@xxxxxxxxxxxxxxx/
> > > > > > [2] https://github.com/qemu/qemu
> > > > > > [3] https://chromium.googlesource.com/chromiumos/platform/crosvm/
> > > > > > [4] https://github.com/chao-p/qemu/tree/privmem-v9
> > > > > > [5] https://android-kvm.googlesource.com/kvmtool/+/refs/heads/tabba/fdmem-v9-core
> > > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > Regarding IPC memory sharing, is mmap'ing an memfd file enough to enable
> > > > > > > that? If more work is needed for it, then wouldn't it make more sense to do
> > > > > > > all the changes at once? This change might look sensible right now, but it
> > > > > > > might turn out that it was the wrong way to go about it when someone
> > > > > > > actually starts implementing memory sharing.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Alex
> > > > > > >
> > > > > > > >
> > > > > > > > Signed-off-by: Fuad Tabba <tabba@xxxxxxxxxx>
> > > > > > > >
> > > > > > > > [*] https://lore.kernel.org/all/20221025151344.3784230-1-chao.p.peng@xxxxxxxxxxxxxxx/
> > > > > > > > ---
> > > > > > > >  include/kvm/kvm.h  |  1 +
> > > > > > > >  include/kvm/util.h |  3 +++
> > > > > > > >  kvm.c              |  4 ++++
> > > > > > > >  util/util.c        | 33 ++++++++++++++++++++-------------
> > > > > > > >  4 files changed, 28 insertions(+), 13 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/include/kvm/kvm.h b/include/kvm/kvm.h
> > > > > > > > index 3872dc6..d0d519b 100644
> > > > > > > > --- a/include/kvm/kvm.h
> > > > > > > > +++ b/include/kvm/kvm.h
> > > > > > > > @@ -87,6 +87,7 @@ struct kvm {
> > > > > > > >       struct kvm_config       cfg;
> > > > > > > >       int                     sys_fd;         /* For system ioctls(), i.e. /dev/kvm */
> > > > > > > >       int                     vm_fd;          /* For VM ioctls() */
> > > > > > > > +     int                     ram_fd;         /* For guest memory. */
> > > > > > > >       timer_t                 timerid;        /* Posix timer for interrupts */
> > > > > > > >
> > > > > > > >       int                     nrcpus;         /* Number of cpus to run */
> > > > > > > > diff --git a/include/kvm/util.h b/include/kvm/util.h
> > > > > > > > index 61a205b..369603b 100644
> > > > > > > > --- a/include/kvm/util.h
> > > > > > > > +++ b/include/kvm/util.h
> > > > > > > > @@ -140,6 +140,9 @@ static inline int pow2_size(unsigned long x)
> > > > > > > >  }
> > > > > > > >
> > > > > > > >  struct kvm;
> > > > > > > > +int memfd_alloc(u64 size, bool hugetlb, u64 blk_size);
> > > > > > > > +void *mmap_anon_or_hugetlbfs_align(struct kvm *kvm, const char *htlbfs_path,
> > > > > > > > +                                u64 size, u64 align);
> > > > > > > >  void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size);
> > > > > > > >
> > > > > > > >  #endif /* KVM__UTIL_H */
> > > > > > > > diff --git a/kvm.c b/kvm.c
> > > > > > > > index 78bc0d8..ed29d68 100644
> > > > > > > > --- a/kvm.c
> > > > > > > > +++ b/kvm.c
> > > > > > > > @@ -160,6 +160,7 @@ struct kvm *kvm__new(void)
> > > > > > > >       mutex_init(&kvm->mem_banks_lock);
> > > > > > > >       kvm->sys_fd = -1;
> > > > > > > >       kvm->vm_fd = -1;
> > > > > > > > +     kvm->ram_fd = -1;
> > > > > > > >
> > > > > > > >  #ifdef KVM_BRLOCK_DEBUG
> > > > > > > >       kvm->brlock_sem = (pthread_rwlock_t) PTHREAD_RWLOCK_INITIALIZER;
> > > > > > > > @@ -174,6 +175,9 @@ int kvm__exit(struct kvm *kvm)
> > > > > > > >
> > > > > > > >       kvm__arch_delete_ram(kvm);
> > > > > > > >
> > > > > > > > +     if (kvm->ram_fd >= 0)
> > > > > > > > +             close(kvm->ram_fd);
> > > > > > > > +
> > > > > > > >       list_for_each_entry_safe(bank, tmp, &kvm->mem_banks, list) {
> > > > > > > >               list_del(&bank->list);
> > > > > > > >               free(bank);
> > > > > > > > diff --git a/util/util.c b/util/util.c
> > > > > > > > index d3483d8..278bcc2 100644
> > > > > > > > --- a/util/util.c
> > > > > > > > +++ b/util/util.c
> > > > > > > > @@ -102,36 +102,38 @@ static u64 get_hugepage_blk_size(const char *htlbfs_path)
> > > > > > > >       return sfs.f_bsize;
> > > > > > > >  }
> > > > > > > >
> > > > > > > > -static void *mmap_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size, u64 blk_size)
> > > > > > > > +int memfd_alloc(u64 size, bool hugetlb, u64 blk_size)
> > > > > > > >  {
> > > > > > > >       const char *name = "kvmtool";
> > > > > > > >       unsigned int flags = 0;
> > > > > > > >       int fd;
> > > > > > > > -     void *addr;
> > > > > > > > -     int htsize = __builtin_ctzl(blk_size);
> > > > > > > >
> > > > > > > > -     if ((1ULL << htsize) != blk_size)
> > > > > > > > -             die("Hugepage size must be a power of 2.\n");
> > > > > > > > +     if (hugetlb) {
> > > > > > > > +             int htsize = __builtin_ctzl(blk_size);
> > > > > > > >
> > > > > > > > -     flags |= MFD_HUGETLB;
> > > > > > > > -     flags |= htsize << MFD_HUGE_SHIFT;
> > > > > > > > +             if ((1ULL << htsize) != blk_size)
> > > > > > > > +                     die("Hugepage size must be a power of 2.\n");
> > > > > > > > +
> > > > > > > > +             flags |= MFD_HUGETLB;
> > > > > > > > +             flags |= htsize << MFD_HUGE_SHIFT;
> > > > > > > > +     }
> > > > > > > >
> > > > > > > >       fd = memfd_create(name, flags);
> > > > > > > >       if (fd < 0)
> > > > > > > > -             die("Can't memfd_create for hugetlbfs map\n");
> > > > > > > > +             die("Can't memfd_create for memory map\n");
> > > > > > > > +
> > > > > > > >       if (ftruncate(fd, size) < 0)
> > > > > > > >               die("Can't ftruncate for mem mapping size %lld\n",
> > > > > > > >                       (unsigned long long)size);
> > > > > > > > -     addr = mmap(NULL, size, PROT_RW, MAP_PRIVATE, fd, 0);
> > > > > > > > -     close(fd);
> > > > > > > >
> > > > > > > > -     return addr;
> > > > > > > > +     return fd;
> > > > > > > >  }
> > > > > > > >
> > > > > > > >  /* This function wraps the decision between hugetlbfs map (if requested) or normal mmap */
> > > > > > > >  void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size)
> > > > > > > >  {
> > > > > > > >       u64 blk_size = 0;
> > > > > > > > +     int fd;
> > > > > > > >
> > > > > > > >       /*
> > > > > > > >        * We don't /need/ to map guest RAM from hugetlbfs, but we do so
> > > > > > > > @@ -146,9 +148,14 @@ void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size)
> > > > > > > >               }
> > > > > > > >
> > > > > > > >               kvm->ram_pagesize = blk_size;
> > > > > > > > -             return mmap_hugetlbfs(kvm, htlbfs_path, size, blk_size);
> > > > > > > >       } else {
> > > > > > > >               kvm->ram_pagesize = getpagesize();
> > > > > > > > -             return mmap(NULL, size, PROT_RW, MAP_ANON_NORESERVE, -1, 0);
> > > > > > > >       }
> > > > > > > > +
> > > > > > > > +     fd = memfd_alloc(size, htlbfs_path, blk_size);
> > > > > > > > +     if (fd < 0)
> > > > > > > > +             return MAP_FAILED;
> > > > > > > > +
> > > > > > > > +     kvm->ram_fd = fd;
> > > > > > > > +     return mmap(NULL, size, PROT_RW, MAP_PRIVATE, kvm->ram_fd, 0);
> > > > > > > >  }
> > > > > > > > --
> > > > > > > > 2.38.1.431.g37b22c650d-goog
> > > > > > > >