On Wed, Nov 09, 2022 at 06:54:04PM +0300, Kirill A. Shutemov wrote: > On Mon, Nov 07, 2022 at 04:41:41PM -0800, Isaku Yamahata wrote: > > On Thu, Nov 03, 2022 at 05:43:52PM +0530, > > Vishal Annapurve <vannapurve@xxxxxxxxxx> wrote: > > > > > On Tue, Oct 25, 2022 at 8:48 PM Chao Peng <chao.p.peng@xxxxxxxxxxxxxxx> wrote: > > > > > > > > This patch series implements KVM guest private memory for confidential > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses > > > > TDX-protected guest memory, machine check can happen which can further > > > > crash the running host system, this is terrible for multi-tenant > > > > configurations. The host accesses include those from KVM userspace like > > > > QEMU. This series addresses KVM userspace induced crash by introducing > > > > new mm and KVM interfaces so KVM userspace can still manage guest memory > > > > via a fd-based approach, but it can never access the guest memory > > > > content. > > > > > > > > The patch series touches both core mm and KVM code. I appreciate > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other > > > > reviews are always welcome. > > > > - 01: mm change, target for mm tree > > > > - 02-08: KVM change, target for KVM tree > > > > > > > > Given KVM is the only current user for the mm part, I have chatted with > > > > Paolo and he is OK to merge the mm change through KVM tree, but > > > > reviewed-by/acked-by is still expected from the mm people. > > > > > > > > The patches have been verified in Intel TDX environment, but Vishal has > > > > done an excellent work on the selftests[4] which are dedicated for this > > > > series, making it possible to test this series without innovative > > > > hardware and fancy steps of building a VM environment. See Test section > > > > below for more info. > > > > > > > > > > > > Introduction > > > > ============ > > > > KVM userspace being able to crash the host is horrible. Under current > > > > KVM architecture, all guest memory is inherently accessible from KVM > > > > userspace and is exposed to the mentioned crash issue. The goal of this > > > > series is to provide a solution to align mm and KVM, on a userspace > > > > inaccessible approach of exposing guest memory. > > > > > > > > Normally, KVM populates secondary page table (e.g. EPT) by using a host > > > > virtual address (hva) from core mm page table (e.g. x86 userspace page > > > > table). This requires guest memory being mmaped into KVM userspace, but > > > > this is also the source where the mentioned crash issue can happen. In > > > > theory, apart from those 'shared' memory for device emulation etc, guest > > > > memory doesn't have to be mmaped into KVM userspace. > > > > > > > > This series introduces fd-based guest memory which will not be mmaped > > > > into KVM userspace. KVM populates secondary page table by using a > > > > > > With no mappings in place for userspace VMM, IIUC, looks like the host > > > kernel will not be able to find the culprit userspace process in case > > > of Machine check error on guest private memory. As implemented in > > > hwpoison_user_mappings, host kernel tries to look at the processes > > > which have mapped the pfns with hardware error. > > > > > > Is there a modification needed in mce handling logic of the host > > > kernel to immediately send a signal to the vcpu thread accessing > > > faulting pfn backing guest private memory? > > > > mce_register_decode_chain() can be used. MCE physical address(p->mce_addr) > > includes host key id in addition to real physical address. By searching used > > hkid by KVM, we can determine if the page is assigned to guest TD or not. If > > yes, send SIGBUS. > > > > kvm_machine_check() can be enhanced for KVM specific use. This is before > > memory_failure() is called, though. > > > > any other ideas? > > That's too KVM-centric. It will not work for other possible user of > restricted memfd. > > I tried to find a way to get it right: we need to get restricted memfd > code info about corrupted page so it can invalidate its users. On the next > request of the page the user will see an error. In case of KVM, the error > will likely escalate to SIGBUS. > > The problem is that core-mm code that handles memory failure knows nothing > about restricted memfd. It only sees that the page belongs to a normal > memfd. > > AFAICS, there's no way to get it intercepted from the shim level. shmem > code has to be patches. shmem_error_remove_page() has to call into > restricted memfd code. > > Hugh, are you okay with this? Or maybe you have a better idea? Okay, here is what I've come up with. It doesn't touch shmem code, but hooks up directly into memory-failure.c. It is still ugly, but should be tolerable. restrictedmem_error_page() loops over all restrictedmem inodes. It is slow, but memory failure is not hot path (I hope). Only build-tested. Chao, could you hook up ->error for KVM and get it tested? diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h index 9c37c3ea3180..c2700c5daa43 100644 --- a/include/linux/restrictedmem.h +++ b/include/linux/restrictedmem.h @@ -12,6 +12,8 @@ struct restrictedmem_notifier_ops { pgoff_t start, pgoff_t end); void (*invalidate_end)(struct restrictedmem_notifier *notifier, pgoff_t start, pgoff_t end); + void (*error)(struct restrictedmem_notifier *notifier, + pgoff_t start, pgoff_t end); }; struct restrictedmem_notifier { @@ -34,6 +36,8 @@ static inline bool file_is_restrictedmem(struct file *file) return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC; } +void restrictedmem_error_page(struct page *page, struct address_space *mapping); + #else static inline void restrictedmem_register_notifier(struct file *file, @@ -57,6 +61,11 @@ static inline bool file_is_restrictedmem(struct file *file) return false; } +static inline void restrictedmem_error_page(struct page *page, + struct address_space *mapping) +{ +} + #endif /* CONFIG_RESTRICTEDMEM */ #endif /* _LINUX_RESTRICTEDMEM_H */ diff --git a/mm/memory-failure.c b/mm/memory-failure.c index e7ac570dda75..ee85e46c6992 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -62,6 +62,7 @@ #include <linux/page-isolation.h> #include <linux/pagewalk.h> #include <linux/shmem_fs.h> +#include <linux/restrictedmem.h> #include "swap.h" #include "internal.h" #include "ras/ras_event.h" @@ -939,6 +940,8 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p) goto out; } + restrictedmem_error_page(p, mapping); + /* * The shmem page is kept in page cache instead of truncating * so is expected to have an extra refcount after error-handling. diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c index e5bf8907e0f8..0dcdff0d8055 100644 --- a/mm/restrictedmem.c +++ b/mm/restrictedmem.c @@ -29,6 +29,18 @@ static void restrictedmem_notifier_invalidate(struct restrictedmem_data *data, mutex_unlock(&data->lock); } +static void restrictedmem_notifier_error(struct restrictedmem_data *data, + pgoff_t start, pgoff_t end) +{ + struct restrictedmem_notifier *notifier; + + mutex_lock(&data->lock); + list_for_each_entry(notifier, &data->notifiers, list) { + notifier->ops->error(notifier, start, end); + } + mutex_unlock(&data->lock); +} + static int restrictedmem_release(struct inode *inode, struct file *file) { struct restrictedmem_data *data = inode->i_mapping->private_data; @@ -248,3 +260,30 @@ int restrictedmem_get_page(struct file *file, pgoff_t offset, return 0; } EXPORT_SYMBOL_GPL(restrictedmem_get_page); + +void restrictedmem_error_page(struct page *page, struct address_space *mapping) +{ + struct super_block *sb = restrictedmem_mnt->mnt_sb; + struct inode *inode, *next; + + if (!shmem_mapping(mapping)) + return; + + spin_lock(&sb->s_inode_list_lock); + list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) { + struct restrictedmem_data *data = inode->i_mapping->private_data; + struct file *memfd = data->memfd; + + if (memfd->f_mapping == mapping) { + pgoff_t start, end; + + spin_unlock(&sb->s_inode_list_lock); + + start = page->index; + end = start + thp_nr_pages(page); + restrictedmem_notifier_error(data, start, end); + return; + } + } + spin_unlock(&sb->s_inode_list_lock); +} -- Kiryl Shutsemau / Kirill A. Shutemov