Hi, On Tue, Nov 29, 2022 at 6:10 PM Alexandru Elisei <alexandru.elisei@xxxxxxx> wrote: > > Hi, > > On Mon, Nov 28, 2022 at 08:49:29AM +0000, Fuad Tabba wrote: > > Hi, > > > > First I want to mention that I really appreciate your feedback, which > > has already been quite helpful. I would like you to please consider > > this to be an RFC, and let's use these patches as a basis for > > discussion and how they can be improved when I respin them, even if > > that means waiting until the kvm fd-based proposal is finalized. > > For that it's probably best if you add RFC to the subject prefix. That's > very helpful to let the reviewers know what to focus on, more on the > approach than on the finer details. I will respin this as an RFC, and I will include the patches that I have that support the restricted memory proposal [*] for pKVM as it stands now. I hope that would help see where I was thinking this would be heading. [*] https://lore.kernel.org/all/20221025151344.3784230-1-chao.p.peng@xxxxxxxxxxxxxxx/ <snip> > > > > With simpler I didn't mean fewer lines of code, rather that it's > > > > easier to reason about, more shared code. With this series, hugetlb > > > > > > How is all of the code that has been added easier to reason about than one > > > single mmap call? > > Would be nice if this would be answered. As I said in a reply to a different comment, for me personally, as a first time kvmtool contributor, it was easier for me to reason about the memory when the canonical reference to the memory is a file descriptor that would not change, rather than a userspace memory address which could change as it is aligned and trimmed. I use the word simpler subjectively, that is, in my opinion. <snip> > > > > Because we are sure that it will be fd-based, and because I thought > > that getting a head start to set the scene would be helpful. The part > > that is uncertain is the kvm capabilities, flags, and names of the new > > memory region extensions, none of which I address in these patches. > > I see, that makes sense. My feedback so far is that you haven't provided a > good reason why this change to anonymous memory makes sense right now. I appreciate your feedback, and I hope we can continue this discussion when I respin this as an RFC. Cheers, /fuad > Thanks, > Alex > > > > > Cheers, > > /fuad > > > > > Thanks, > > > Alex > > > > > > > > > > > Cheers, > > > > /fuad > > > > > > > > > > > > > > > > > Thanks, > > > > > Alex > > > > > > > > > > > could have a look on how I would go ahead building on these patches > > > > > > for full support of private memory backed by an fd. > > > > > > > > > > > > > Regarding IPC memory sharing, is mmap'ing an memfd file enough to enable > > > > > > > that? If more work is needed for it, then wouldn't it make more sense to do > > > > > > > all the changes at once? This change might look sensible right now, but it > > > > > > > might turn out that it was the wrong way to go about it when someone > > > > > > > actually starts implementing memory sharing. > > > > > > > > > > > > I don’t plan on supporting IPC memory sharing. I just mentioned that > > > > > > as yet another use case that would benefit from guest memory being > > > > > > fd-based, should kvmtool decide to support it in the future. > > > > > > > > > > > > Cheers, > > > > > > /fuad > > > > > > > > > > > > [1] https://lore.kernel.org/all/20221025151344.3784230-1-chao.p.peng@xxxxxxxxxxxxxxx/ > > > > > > [2] https://github.com/qemu/qemu > > > > > > [3] https://chromium.googlesource.com/chromiumos/platform/crosvm/ > > > > > > [4] https://github.com/chao-p/qemu/tree/privmem-v9 > > > > > > [5] https://android-kvm.googlesource.com/kvmtool/+/refs/heads/tabba/fdmem-v9-core > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Regarding IPC memory sharing, is mmap'ing an memfd file enough to enable > > > > > > > that? If more work is needed for it, then wouldn't it make more sense to do > > > > > > > all the changes at once? This change might look sensible right now, but it > > > > > > > might turn out that it was the wrong way to go about it when someone > > > > > > > actually starts implementing memory sharing. > > > > > > > > > > > > > > Thanks, > > > > > > > Alex > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Fuad Tabba <tabba@xxxxxxxxxx> > > > > > > > > > > > > > > > > [*] https://lore.kernel.org/all/20221025151344.3784230-1-chao.p.peng@xxxxxxxxxxxxxxx/ > > > > > > > > --- > > > > > > > > include/kvm/kvm.h | 1 + > > > > > > > > include/kvm/util.h | 3 +++ > > > > > > > > kvm.c | 4 ++++ > > > > > > > > util/util.c | 33 ++++++++++++++++++++------------- > > > > > > > > 4 files changed, 28 insertions(+), 13 deletions(-) > > > > > > > > > > > > > > > > diff --git a/include/kvm/kvm.h b/include/kvm/kvm.h > > > > > > > > index 3872dc6..d0d519b 100644 > > > > > > > > --- a/include/kvm/kvm.h > > > > > > > > +++ b/include/kvm/kvm.h > > > > > > > > @@ -87,6 +87,7 @@ struct kvm { > > > > > > > > struct kvm_config cfg; > > > > > > > > int sys_fd; /* For system ioctls(), i.e. /dev/kvm */ > > > > > > > > int vm_fd; /* For VM ioctls() */ > > > > > > > > + int ram_fd; /* For guest memory. */ > > > > > > > > timer_t timerid; /* Posix timer for interrupts */ > > > > > > > > > > > > > > > > int nrcpus; /* Number of cpus to run */ > > > > > > > > diff --git a/include/kvm/util.h b/include/kvm/util.h > > > > > > > > index 61a205b..369603b 100644 > > > > > > > > --- a/include/kvm/util.h > > > > > > > > +++ b/include/kvm/util.h > > > > > > > > @@ -140,6 +140,9 @@ static inline int pow2_size(unsigned long x) > > > > > > > > } > > > > > > > > > > > > > > > > struct kvm; > > > > > > > > +int memfd_alloc(u64 size, bool hugetlb, u64 blk_size); > > > > > > > > +void *mmap_anon_or_hugetlbfs_align(struct kvm *kvm, const char *htlbfs_path, > > > > > > > > + u64 size, u64 align); > > > > > > > > void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size); > > > > > > > > > > > > > > > > #endif /* KVM__UTIL_H */ > > > > > > > > diff --git a/kvm.c b/kvm.c > > > > > > > > index 78bc0d8..ed29d68 100644 > > > > > > > > --- a/kvm.c > > > > > > > > +++ b/kvm.c > > > > > > > > @@ -160,6 +160,7 @@ struct kvm *kvm__new(void) > > > > > > > > mutex_init(&kvm->mem_banks_lock); > > > > > > > > kvm->sys_fd = -1; > > > > > > > > kvm->vm_fd = -1; > > > > > > > > + kvm->ram_fd = -1; > > > > > > > > > > > > > > > > #ifdef KVM_BRLOCK_DEBUG > > > > > > > > kvm->brlock_sem = (pthread_rwlock_t) PTHREAD_RWLOCK_INITIALIZER; > > > > > > > > @@ -174,6 +175,9 @@ int kvm__exit(struct kvm *kvm) > > > > > > > > > > > > > > > > kvm__arch_delete_ram(kvm); > > > > > > > > > > > > > > > > + if (kvm->ram_fd >= 0) > > > > > > > > + close(kvm->ram_fd); > > > > > > > > + > > > > > > > > list_for_each_entry_safe(bank, tmp, &kvm->mem_banks, list) { > > > > > > > > list_del(&bank->list); > > > > > > > > free(bank); > > > > > > > > diff --git a/util/util.c b/util/util.c > > > > > > > > index d3483d8..278bcc2 100644 > > > > > > > > --- a/util/util.c > > > > > > > > +++ b/util/util.c > > > > > > > > @@ -102,36 +102,38 @@ static u64 get_hugepage_blk_size(const char *htlbfs_path) > > > > > > > > return sfs.f_bsize; > > > > > > > > } > > > > > > > > > > > > > > > > -static void *mmap_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size, u64 blk_size) > > > > > > > > +int memfd_alloc(u64 size, bool hugetlb, u64 blk_size) > > > > > > > > { > > > > > > > > const char *name = "kvmtool"; > > > > > > > > unsigned int flags = 0; > > > > > > > > int fd; > > > > > > > > - void *addr; > > > > > > > > - int htsize = __builtin_ctzl(blk_size); > > > > > > > > > > > > > > > > - if ((1ULL << htsize) != blk_size) > > > > > > > > - die("Hugepage size must be a power of 2.\n"); > > > > > > > > + if (hugetlb) { > > > > > > > > + int htsize = __builtin_ctzl(blk_size); > > > > > > > > > > > > > > > > - flags |= MFD_HUGETLB; > > > > > > > > - flags |= htsize << MFD_HUGE_SHIFT; > > > > > > > > + if ((1ULL << htsize) != blk_size) > > > > > > > > + die("Hugepage size must be a power of 2.\n"); > > > > > > > > + > > > > > > > > + flags |= MFD_HUGETLB; > > > > > > > > + flags |= htsize << MFD_HUGE_SHIFT; > > > > > > > > + } > > > > > > > > > > > > > > > > fd = memfd_create(name, flags); > > > > > > > > if (fd < 0) > > > > > > > > - die("Can't memfd_create for hugetlbfs map\n"); > > > > > > > > + die("Can't memfd_create for memory map\n"); > > > > > > > > + > > > > > > > > if (ftruncate(fd, size) < 0) > > > > > > > > die("Can't ftruncate for mem mapping size %lld\n", > > > > > > > > (unsigned long long)size); > > > > > > > > - addr = mmap(NULL, size, PROT_RW, MAP_PRIVATE, fd, 0); > > > > > > > > - close(fd); > > > > > > > > > > > > > > > > - return addr; > > > > > > > > + return fd; > > > > > > > > } > > > > > > > > > > > > > > > > /* This function wraps the decision between hugetlbfs map (if requested) or normal mmap */ > > > > > > > > void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size) > > > > > > > > { > > > > > > > > u64 blk_size = 0; > > > > > > > > + int fd; > > > > > > > > > > > > > > > > /* > > > > > > > > * We don't /need/ to map guest RAM from hugetlbfs, but we do so > > > > > > > > @@ -146,9 +148,14 @@ void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size) > > > > > > > > } > > > > > > > > > > > > > > > > kvm->ram_pagesize = blk_size; > > > > > > > > - return mmap_hugetlbfs(kvm, htlbfs_path, size, blk_size); > > > > > > > > } else { > > > > > > > > kvm->ram_pagesize = getpagesize(); > > > > > > > > - return mmap(NULL, size, PROT_RW, MAP_ANON_NORESERVE, -1, 0); > > > > > > > > } > > > > > > > > + > > > > > > > > + fd = memfd_alloc(size, htlbfs_path, blk_size); > > > > > > > > + if (fd < 0) > > > > > > > > + return MAP_FAILED; > > > > > > > > + > > > > > > > > + kvm->ram_fd = fd; > > > > > > > > + return mmap(NULL, size, PROT_RW, MAP_PRIVATE, kvm->ram_fd, 0); > > > > > > > > } > > > > > > > > -- > > > > > > > > 2.38.1.431.g37b22c650d-goog > > > > > > > >