Re: [RFC PATCH 0/4] SGX shmem backing store issue

Dave Hansen <dave.hansen@xxxxxxxxx> · Thu, 28 Apr 2022 14:12:34 -0700

On 4/28/22 13:11, Reinette Chatre wrote:
> ELDU returned 1073741837 (0x4000000d)
> WARNING: CPU: 72 PID: 24407 at arch/x86/kernel/cpu/sgx/encl.c:81 sgx_encl_eldu+0x3cf/0x400
> ...
> Call Trace:
> <TASK>
> ? xa_load+0x6e/0xa0
> __sgx_encl_load_page+0x3d/0x80
> sgx_encl_load_page_in_vma+0x4a/0x60
> sgx_vma_fault+0x7f/0x3b0

First of all, thanks for all the work to narrow this down.

It sounds like there are probably at least two failure modes at play here:

	1. shmem_read_mapping_page_gfp() is called to retrieve an
	   existing page, but an empty one is allocated instead.  ELDU
	   fails on the empty page.  This one should be fixed by patch 	
	   4/4.
	2. shmem_read_mapping_page_gfp() actually finds a page, but it
	   still fails ELDU.

Is that right?

If so, I'd probably delve deeper into what the page and the PCMD look
like.  I usually go after these kinds of things with tracing.  I'd
probably dump some representation of the PCMD and page contents with
trace_printk().  Dump them when the at __sgx_encl_ewb() time, then also
dump them where the warning is being hit.  Pair the warning with a
tracing_off().

// A crude checksum:
u64 sum_page(u64 *page)
{
	u64 ret = 0
	int i;

	for (i = 0; i < PAGE_SIZE/sizeof(u64)); i++)
		ret += page[i];

	return ret;
}

Then, logically something like this:

	trace_printk("bad ELDU on shm page: %x sum: pcmd: %x %x...\n",
		page_to_pfn(shm_page), sum_page(page_kmap),
		&pcmd, ...);

Both at EWB time and ELDU time.  Let's see if the pages that are coming
out of shmem are the same as the ones that were put in.

When you hit the warning, tracing should turn itself off.  Then, you can
just grep through the trace for that same pfn.