Re: [PATCH v5 2/3] x86/sgx: Fine grained SGX MCA behavior for virtualization

Dave Hansen <dave.hansen@xxxxxxxxx> · Thu, 21 Jul 2022 09:54:39 -0700

On 6/22/22 02:37, Zhiquan Li wrote:
> When VM guest access a SGX EPC page with memory failure, current
> behavior will kill the guest, expected only kill the SGX application
> inside it.

Can we please clean this up?  This is generally readable, but _hard_ to
read.  Perhaps:

	Today, if a guest accesses an SGX EPC page with memory failure,
	the kernel will behavior will kill the entire guest.  This blast
	radius is too large.  It would be idea to kill only the SGX
	application inside the guest.

> To fix it we send SIGBUS with code BUS_MCEERR_AR and some extra

	    ^ No "we's".

> information for hypervisor to inject #MC information to guest, which is
> helpful in SGX case.

To fix this, send a SIGBUS to host userspace (like QEMU) which can
follow up by injecting a #MC to the guest.

> The rest of things are guest side. Currently the hypervisor like Qemu
> already has mature facility to convert HVA to GPA and inject #MC to
> the guest OS.
> 
> Unlike host enclaves, virtual EPC instance cannot be shared by multiple
> VMs.  It is because how enclaves are created is totally up to the guest.
> Sharing virtual EPC instance will be very likely to unexpectedly break
> enclaves in all VMs.

I'm not sure why this is here or why it is important to this patch.

> SGX virtual EPC driver doesn't explicitly prevent virtual EPC instance
> being shared by multiple VMs via fork().  However KVM doesn't support
> running a VM across multiple mm structures, and the de facto userspace
> hypervisor (Qemu) doesn't use fork() to create a new VM, so in practice
> this should not happen.

> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index ab4ec54bbdd9..4507c2302348 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -715,6 +715,8 @@ int arch_memory_failure(unsigned long pfn, int flags)
>  	struct sgx_epc_page *page = sgx_paddr_to_page(pfn << PAGE_SHIFT);
>  	struct sgx_epc_section *section;
>  	struct sgx_numa_node *node;
> +	unsigned long vaddr;
> +	int ret;
>  
>  	/*
>  	 * mm/memory-failure.c calls this routine for all errors
> @@ -731,8 +733,26 @@ int arch_memory_failure(unsigned long pfn, int flags)
>  	 * error. The signal may help the task understand why the
>  	 * enclave is broken.
>  	 */
> -	if (flags & MF_ACTION_REQUIRED)
> -		force_sig(SIGBUS);
> +	if (flags & MF_ACTION_REQUIRED) {
> +		/*
> +		 * Provide extra info to the task so that it can make further
> +		 * decision but not simply kill it. This is quite useful for
> +		 * virtualization case.
> +		 */
> +		if (page->flags & SGX_EPC_PAGE_KVM_GUEST) {
> +			/*
> +			 * The "owner" field is repurposed as the virtual address
> +			 * of virtual EPC page.
> +			 */
> +			vaddr = (unsigned long)page->owner & PAGE_MASK;

I really don't like repurposing page->owner like this.  It requires
casting on *both* sides of a type that we have full control over.

	struct sgx_epc_page {
	        unsigned int section;
	        u16 flags;
	        u16 poison;
		union {
		        struct sgx_encl_page *encl_owner;
			// Use when SGX_EPC_PAGE_KVM_GUEST
			// set in ->flags:
		        void __user *vepc_vaddr;
		};
	        struct list_head list;
	};

There is zero reason to play casting games instead of doing that ^

> +			ret = force_sig_mceerr(BUS_MCEERR_AR, (void __user *)vaddr,
> +					PAGE_SHIFT);
> +			if (ret < 0)
> +				pr_err("Memory failure: Error sending signal to %s:%d: %d\n",
> +					current->comm, current->pid, ret);
> +		} else
> +			force_sig(SIGBUS);
> +	}