On 2/13/20 2:13 PM, Christian Borntraeger wrote: > > > On 13.02.20 20:56, Sean Christopherson wrote: >> On Mon, Feb 10, 2020 at 06:27:04PM +0100, Christian Borntraeger wrote: >>> CC Marc Zyngier for KVM on ARM. Marc, see below. Will there be any >>> use for this on KVM/ARM in the future? >>> >>> CC Sean Christopherson/Tom Lendacky. Any obvious use case for Intel/AMD >>> to have a callback before a page is used for I/O? >From an SEV-SNP perspective, I don't think so. The SEV-SNP architecture uses page states and having the hypervisor change the state from beneath the guest might trigger the guest into thinking it's being attacked vs just allowing the I/O to fail. Is this a concern with flooding the console with I/O error messages? >> >> Yes? >> >>> Andrew (or other mm people) any chance to get an ACK for this change? >>> I could then carry that via s390 or KVM tree. Or if you want to carry >>> that yourself I can send an updated version (we need to kind of >>> synchronize that Linus will pull the KVM changes after the mm changes). >>> >>> Andrea asked if others would benefit from this, so here are some more >>> information about this (and I can also put this into the patch >>> description). So we have talked to the POWER folks. They do not use >>> the standard normal memory management, instead they have a hard split >>> between secure and normal memory. The secure memory is the handled by >>> the hypervisor as device memory and the ultravisor and the hypervisor >>> move this forth and back when needed. >>> >>> On s390 there is no *separate* pool of physical pages that are secure. >>> Instead, *any* physical page can be marked as secure or not, by >>> setting a bit in a per-page data structure that hardware uses to stop >>> unauthorized access. (That bit is under control of the ultravisor.) >>> >>> Note that one side effect of this strategy is that the decision >>> *which* secure pages to encrypt and then swap out is actually done by >>> the hypervisor, not the ultravisor. In our case, the hypervisor is >>> Linux/KVM, so we're using the regular Linux memory management scheme >>> (active/inactive LRU lists etc.) to make this decision. The advantage >>> is that the Ultravisor code does not need to itself implement any >>> memory management code, making it a lot simpler. >> >> Disclaimer: I'm not familiar with s390 guest page faults or UV. I tried >> to give myself a crash course, apologies if I'm way out in left field... >> >> AIUI, pages will first be added to a secure guest by converting a normal, >> non-secure page to secure and stuffing it into the guest page tables. To >> swap a page from a secure guest, arch_make_page_accessible() will be called >> to encrypt the page in place so that it can be accessed by the untrusted >> kernel/VMM and written out to disk. And to fault the page back in, on s390 >> a secure guest access to a non-secure page will generate a page fault with >> a dedicated type. That fault routes directly to >> do_non_secure_storage_access(), which converts the page to secure and thus >> makes it re-accessible to the guest. >> >> That all sounds sane and usable for Intel. >> >> My big question is the follow/get flows, more on that below. >> >>> However, in the end this is why we need the hook into Linux memory >>> management: once Linux has decided to swap a page out, we need to get >>> a chance to tell the Ultravisor to "export" the page (i.e., encrypt >>> its contents and mark it no longer secure). >>> >>> As outlined below this should be a no-op for anybody not opting in. >>> >>> Christian >>> >>> On 07.02.20 12:39, Christian Borntraeger wrote: >>>> From: Claudio Imbrenda <imbrenda@xxxxxxxxxxxxx> >>>> >>>> With the introduction of protected KVM guests on s390 there is now a >>>> concept of inaccessible pages. These pages need to be made accessible >>>> before the host can access them. >>>> >>>> While cpu accesses will trigger a fault that can be resolved, I/O >>>> accesses will just fail. We need to add a callback into architecture >>>> code for places that will do I/O, namely when writeback is started or >>>> when a page reference is taken. >>>> >>>> Signed-off-by: Claudio Imbrenda <imbrenda@xxxxxxxxxxxxx> >>>> Signed-off-by: Christian Borntraeger <borntraeger@xxxxxxxxxx> >>>> --- >>>> include/linux/gfp.h | 6 ++++++ >>>> mm/gup.c | 2 ++ >>>> mm/page-writeback.c | 1 + >>>> 3 files changed, 9 insertions(+) >>>> >>>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h >>>> index e5b817cb86e7..be2754841369 100644 >>>> --- a/include/linux/gfp.h >>>> +++ b/include/linux/gfp.h >>>> @@ -485,6 +485,12 @@ static inline void arch_free_page(struct page *page, int order) { } >>>> #ifndef HAVE_ARCH_ALLOC_PAGE >>>> static inline void arch_alloc_page(struct page *page, int order) { } >>>> #endif >>>> +#ifndef HAVE_ARCH_MAKE_PAGE_ACCESSIBLE >>>> +static inline int arch_make_page_accessible(struct page *page) >>>> +{ >>>> + return 0; >>>> +} >>>> +#endif >>>> >>>> struct page * >>>> __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid, >>>> diff --git a/mm/gup.c b/mm/gup.c >>>> index 7646bf993b25..a01262cd2821 100644 >>>> --- a/mm/gup.c >>>> +++ b/mm/gup.c >>>> @@ -257,6 +257,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, >>>> page = ERR_PTR(-ENOMEM); >>>> goto out; >>>> } >>>> + arch_make_page_accessible(page); >> >> As Will pointed out, the return value definitely needs to be checked, there >> will undoubtedly be scenarios where the page cannot be made accessible. > > Actually onm s390 this should always succeed unless we have a bug. > > But we can certainly provide a variant of that patch that does check the return > value. > Proper error handling for gup and WARN_ON for pae-writeback. >> >> What is the use case for calling arch_make_page_accessible() in the follow() >> and gup() paths? Live migration is the only thing that comes to mind, and >> for live migration I would expect you would want to keep the secure guest >> running when copying pages to the target, i.e. use pre-copy. That would >> conflict with converting the page in place. Rather, migration would use a >> separate dedicated path to copy the encrypted contents of the secure page to >> a completely different page, and send *that* across the wire so that the >> guest can continue accessing the original page. >> Am I missing a need to do this for the swap/reclaim case? Or is there a >> completely different use case I'm overlooking? > > This is actually to protect the host against a malicious user space. For > example a bad QEMU could simply start direct I/O on such protected memory. > We do not want userspace to be able to trigger I/O errors and thus we > implemented the logic to "whenever somebody accesses that page (gup) or > doing I/O, make sure that this page can be accessed. When the guest tries > to access that page we will wait in the page fault handler for writeback to > have finished and for the page_ref to be the expected value. So in this case, when the guest tries to access the page, the page may now be corrupted because I/O was allowed to be done to it? Or will the I/O have been blocked in some way, but without generating the I/O error? Thanks, Tom > > > >> >> Tangentially related, hooks here could be quite useful for sanity checking >> the kernel/KVM and/or debugging kernel/KVM bugs. Would it make sense to >> pass a param to arch_make_page_accessible() to provide some information as >> to why the page needs to be made accessible? > > Some kind of enum that can be used optionally to optimize things? > >> >>>> } >>>> if (flags & FOLL_TOUCH) { >>>> if ((flags & FOLL_WRITE) && >>>> @@ -1870,6 +1871,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, >>>> >>>> VM_BUG_ON_PAGE(compound_head(page) != head, page); >>>> >>>> + arch_make_page_accessible(page); >>>> SetPageReferenced(page); >>>> pages[*nr] = page; >>>> (*nr)++; >>>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c >>>> index 2caf780a42e7..0f0bd14571b1 100644 >>>> --- a/mm/page-writeback.c >>>> +++ b/mm/page-writeback.c >>>> @@ -2806,6 +2806,7 @@ int __test_set_page_writeback(struct page *page, bool keep_write) >>>> inc_lruvec_page_state(page, NR_WRITEBACK); >>>> inc_zone_page_state(page, NR_ZONE_WRITE_PENDING); >>>> } >>>> + arch_make_page_accessible(page); >>>> unlock_page_memcg(page); >>> >>> As outlined by Ulrich, we can move the callback after the unlock. >>> >>>> return ret; >>>> >>>> >>> >