Re: [RFC PATCH 0/4] SGX shmem backing store issue

Reinette Chatre <reinette.chatre@xxxxxxxxx> · Mon, 9 May 2022 10:09:24 -0700

Hi Jarkko,

On 5/7/2022 10:48 AM, Jarkko Sakkinen wrote:
> On Sat, May 07, 2022 at 08:46:50PM +0300, Jarkko Sakkinen wrote:
>> On Mon, May 02, 2022 at 02:33:06PM -0700, Dave Hansen wrote:
>>> On 5/2/22 10:11, Reinette Chatre wrote:
>>>> My goal was to prevent the page fault handler from doing a "is the PCMD
>>>> page empty" if the reclaimer has a reference to it. Even if the PCMD page
>>>> is empty when the page fault handler checks it the page is expected to
>>>> get data right when reclaimer can get the mutex back since the reclaimer
>>>> already has a reference to the page.
>>>
>>> I think shmem_truncate_range() might be the wrong operation.  It
>>> destroys data and, in the end, we don't want to destroy data.
>>> Filesystems and the page cache already have nice ways to keep from
>>> destroying data, we just need to use them.
>>>
>>> First, I think set_page_dirty(pcmd_page) before the EWB is a good start.
>>>  That's what filesystems do before important data that needs to be saved
>>> goes into pages.
>>>
>>> Second, I think we need behavior like POSIX_FADV_DONTNEED, not
>>> FALLOC_FL_PUNCH_HOLE.  The DONTNEED operations end up in
>>> mapping_evict_folio(), which has both page refcount *and* dirty page
>>> checks.  That means that either elevating a refcount or set_page_dirty()
>>> will thwart DONTNEED-like behavior.
>>>
>>> There are two basic things we need to do:
>>>
>>> 1. Prevent page from being truncated around EWB time
>>> 2. Prevent unreferenced page with all zeros from staying in shmem
>>>    forever or taking up swap space
>>>
>>> On the EWB (write to PCMD side) I think something like this works:
>>>
>>> 	sgx_encl_get_backing()
>>> 		get_page(pcmd_page)
>>>
>>> 	...
>>> 	lock_page(pcmd_page);
>>> 	// check for truncation since sgx_encl_get_backing()
>>> 	if (pcmd_page->mapping != shmem)
>>> 		goto retry;
>>> 	 // double check this is OK under lock_page():
>>> 	set_page_dirty(pcmd_page);
>>> 	__sgx_encl_ewb();
>>> 	unlock_page(pcmd_page);
>>>
>>> That's basically what filesystems do.  Get the page from the page cache,
>>> lock it, then make sure it is consistent.  If not, retry.
>>>
>>> On the "free" / read in (ELDU) side:
>>>
>>> 	// get pcmd_page ref
>>> 	lock_page(pcmd_page);
>>> 	// truncation is not a concern because that's only done
>>> 	// on the read-in side, here, where we hold encl->lock
>>>
>>> 	memset();
>>> 	if (!memchr_inv())
>>> 		// clear the way for DONTNEED:
>>> 		ClearPageDirty(pcmd_page);
>>> 	unlock_page(pcmd_page);
>>> 	// drop pcmd_page ref
>>> 	...
>>> 	POSIX_FADV_DONTNEED
>>>
>>> There's one downside to this: it's _possible_ that an transient
>>> get_page() would block POSIX_FADV_DONTNEED.  Then the zeroed page would
>>> stick around forever, or at least until the next ELDU operation did
>>> another memchr_inv().
>>>
>>> I went looking around for some of those and could not find any that I
>>> *know* apply to shmem.
>>>
>>> This doesn't feel like a great solution; it's more complicated than I
>>> would like.  Any other ideas?
>>
>> If we could do both truncation and swapping in one side, i.e. in ksgxd,
>> that would simplify this process a lot. Then the whole synchronization
>> problem would not exist.
>>
>> E.g. perhaps #PF handler could just zero PCMD and collect zeroed pages
>> indices to a list and ksgxd would truncate them.
> 
> I.e. instead of immediate response, go for lazy response that is taken 
> care by ksgxd.

Could you please elaborate how you envision this solution? From what I understand
there would be a per-enclave list that contains information about empty
PCMD pages intended to be truncated. The page fault handler adds pages
to this list and the reclaimer needs to remove pages from this list when
it writes to those pages and then do the actual truncation - but it is not
clear how the reclaimer will know when it can safely remove a page from the
list since it obtains PCMD page references in batches.

Did you get a chance to consider the fix proposed in
https://lore.kernel.org/linux-sgx/d4b52482-2dd0-d5f1-bda9-e1d97883298d@xxxxxxxxx/ 

I understand that the email thread may have become hard to follow and I plan
to submit a new series today.

Reinette