Re: [PATCH Part2 RFC v4 10/40] x86/fault: Add support to handle the RMP fault for user address

Dave Hansen <dave.hansen@xxxxxxxxx> · Thu, 8 Jul 2021 09:16:21 -0700

Oh, here's the THP code.  The subject just changed.

On 7/7/21 11:35 AM, Brijesh Singh wrote:
> When SEV-SNP is enabled globally, a write from the host goes through the
> RMP check. When the host writes to pages, hardware checks the following
> conditions at the end of page walk:
> 
> 1. Assigned bit in the RMP table is zero (i.e page is shared).
> 2. If the page table entry that gives the sPA indicates that the target
>    page size is a large page, then all RMP entries for the 4KB
>    constituting pages of the target must have the assigned bit 0.
> 3. Immutable bit in the RMP table is not zero.
> 
> The hardware will raise page fault if one of the above conditions is not
> met. Try resolving the fault instead of taking fault again and again. If
> the host attempts to write to the guest private memory then send the
> SIGBUG signal to kill the process. If the page level between the host and

"SIGBUG"?

> RMP entry does not match, then split the address to keep the RMP and host
> page levels in sync.

> ---
>  arch/x86/mm/fault.c | 69 +++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/mm.h  |  6 +++-
>  mm/memory.c         | 13 +++++++++
>  3 files changed, 87 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 195149eae9b6..cdf48019c1a7 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1281,6 +1281,58 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
>  }
>  NOKPROBE_SYMBOL(do_kern_addr_fault);
>  
> +#define RMP_FAULT_RETRY		0
> +#define RMP_FAULT_KILL		1
> +#define RMP_FAULT_PAGE_SPLIT	2
> +
> +static inline size_t pages_per_hpage(int level)
> +{
> +	return page_level_size(level) / PAGE_SIZE;
> +}
> +
> +static int handle_user_rmp_page_fault(unsigned long hw_error_code, unsigned long address)
> +{
> +	unsigned long pfn, mask;
> +	int rmp_level, level;
> +	struct rmpentry *e;
> +	pte_t *pte;
> +
> +	if (unlikely(!cpu_feature_enabled(X86_FEATURE_SEV_SNP)))
> +		return RMP_FAULT_KILL;

Shouldn't this be a WARN_ON_ONCE()?  How can we get RMP faults without
SEV-SNP?

> +	/* Get the native page level */
> +	pte = lookup_address_in_mm(current->mm, address, &level);
> +	if (unlikely(!pte))
> +		return RMP_FAULT_KILL;

What would this mean?  There was an RMP fault on a non-present page?
How could that happen?  What if there was a race between an unmapping
event and the RMP fault delivery?

> +	pfn = pte_pfn(*pte);
> +	if (level > PG_LEVEL_4K) {
> +		mask = pages_per_hpage(level) - pages_per_hpage(level - 1);
> +		pfn |= (address >> PAGE_SHIFT) & mask;
> +	}

This looks inherently racy.  What happens if there are two parallel RMP
faults on the same 2M page.  One of them splits the page tables, the
other gets a fault for an already-split page table.

Is that handled here somehow?

> +	/* Get the page level from the RMP entry. */
> +	e = snp_lookup_page_in_rmptable(pfn_to_page(pfn), &rmp_level);
> +	if (!e)
> +		return RMP_FAULT_KILL;

The snp_lookup_page_in_rmptable() failure cases looks WARN-worthly.
Either you're doing a lookup for something not *IN* the RMP table, or
you don't support SEV-SNP, in which case you shouldn't be in this code
in the first place.

> +	/*
> +	 * Check if the RMP violation is due to the guest private page access.
> +	 * We can not resolve this RMP fault, ask to kill the guest.
> +	 */
> +	if (rmpentry_assigned(e))
> +		return RMP_FAULT_KILL;

No "We's", please.  Speak in imperative voice.

> +	/*
> +	 * The backing page level is higher than the RMP page level, request
> +	 * to split the page.
> +	 */
> +	if (level > rmp_level)
> +		return RMP_FAULT_PAGE_SPLIT;

This can theoretically trigger on a hugetlbfs page.  Right?

I thought I asked about this before... more below...

> +	return RMP_FAULT_RETRY;
> +}
> +
>  /*
>   * Handle faults in the user portion of the address space.  Nothing in here
>   * should check X86_PF_USER without a specific justification: for almost
> @@ -1298,6 +1350,7 @@ void do_user_addr_fault(struct pt_regs *regs,
>  	struct task_struct *tsk;
>  	struct mm_struct *mm;
>  	vm_fault_t fault;
> +	int ret;
>  	unsigned int flags = FAULT_FLAG_DEFAULT;
>  
>  	tsk = current;
> @@ -1378,6 +1431,22 @@ void 
(struct pt_regs *regs,
>  	if (error_code & X86_PF_INSTR)
>  		flags |= FAULT_FLAG_INSTRUCTION;
>  
> +	/*
> +	 * If its an RMP violation, try resolving it.
> +	 */
> +	if (error_code & X86_PF_RMP) {
> +		ret = handle_user_rmp_page_fault(error_code, address);
> +		if (ret == RMP_FAULT_PAGE_SPLIT) {
> +			flags |= FAULT_FLAG_PAGE_SPLIT;
> +		} else if (ret == RMP_FAULT_KILL) {
> +			fault |= VM_FAULT_SIGBUS;
> +			do_sigbus(regs, error_code, address, fault);
> +			return;
> +		} else {
> +			return;
> +		}
> +	}

Why not just have handle_user_rmp_page_fault() return a VM_FAULT_* code
directly?

I also suspect you can just set VM_FAULT_SIGBUS and let the do_sigbus()
call later on in the function do its work.

>  	 * Faults in the vsyscall page might need emulation.  The
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 322ec61d0da7..211dfe5d3b1d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -450,6 +450,8 @@ extern pgprot_t protection_map[16];
>   * @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
>   * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
>   * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
> + * @FAULT_FLAG_PAGE_SPLIT: The fault was due page size mismatch, split the
> + *  region to smaller page size and retry.
>   *
>   * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
>   * whether we would allow page faults to retry by specifying these two
> @@ -481,6 +483,7 @@ enum fault_flag {
>  	FAULT_FLAG_REMOTE =		1 << 7,
>  	FAULT_FLAG_INSTRUCTION =	1 << 8,
>  	FAULT_FLAG_INTERRUPTIBLE =	1 << 9,
> +	FAULT_FLAG_PAGE_SPLIT =		1 << 10,
>  };
>  
>  /*
> @@ -520,7 +523,8 @@ static inline bool fault_flag_allow_retry_first(enum fault_flag flags)
>  	{ FAULT_FLAG_USER,		"USER" }, \
>  	{ FAULT_FLAG_REMOTE,		"REMOTE" }, \
>  	{ FAULT_FLAG_INSTRUCTION,	"INSTRUCTION" }, \
> -	{ FAULT_FLAG_INTERRUPTIBLE,	"INTERRUPTIBLE" }
> +	{ FAULT_FLAG_INTERRUPTIBLE,	"INTERRUPTIBLE" }, \
> +	{ FAULT_FLAG_PAGE_SPLIT,	"PAGESPLIT" }
>  
>  /*
>   * vm_fault is filled by the pagefault handler and passed to the vma's
> diff --git a/mm/memory.c b/mm/memory.c
> index 730daa00952b..aef261d94e33 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4407,6 +4407,15 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>  	return 0;
>  }
>  
> +static int handle_split_page_fault(struct vm_fault *vmf)
> +{
> +	if (!IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT))
> +		return VM_FAULT_SIGBUS;
> +
> +	__split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
> +	return 0;
> +}

What will this do when you hand it a hugetlbfs page?