Re: [PATCH v19 16/27] x86/sgx: Add the Linux SGX Enclave Driver

Sean Christopherson <sean.j.christopherson@xxxxxxxxx> · Tue, 19 Mar 2019 16:00:47 -0700

On Sun, Mar 17, 2019 at 11:14:45PM +0200, Jarkko Sakkinen wrote:
> Intel Software Guard eXtensions (SGX) is a set of CPU instructions that
> can be used by applications to set aside private regions of code and
> data. The code outside the enclave is disallowed to access the memory
> inside the enclave by the CPU access control.
> 
> This commit adds the Linux SGX Enclave Driver that provides an ioctl API
> to manage enclaves. The address range for an enclave, commonly referred
> as ELRANGE in the documentation (e.g. Intel SDM), is reserved with
> mmap() against /dev/sgx. After that a set ioctls is used to build
> the enclave to the ELRANGE.

...

> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> new file mode 100644
> index 000000000000..bd8bcd748976
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/encl.c

...

> +/**
> + * sgx_encl_next_mm() - Iterate to the next mm
> + * @encl:	an enclave
> + * @mm:		an mm list entry
> + * @iter:	iterator status
> + *
> + * Return: the enclave mm or NULL
> + */
> +struct sgx_encl_mm *sgx_encl_next_mm(struct sgx_encl *encl,
> +				     struct sgx_encl_mm *mm, int *iter)
> +{
> +	struct list_head *entry;
> +
> +	WARN(!encl, "%s: encl is NULL", __func__);
> +	WARN(!iter, "%s: iter is NULL", __func__);
> +
> +	spin_lock(&encl->mm_lock);
> +
> +	entry = mm ? mm->list.next : encl->mm_list.next;
> +	WARN(!entry, "%s: entry is NULL", __func__);
> +
> +	if (entry == &encl->mm_list) {
> +		mm = NULL;
> +		*iter = SGX_ENCL_MM_ITER_DONE;
> +		goto out;
> +	}
> +
> +	mm = list_entry(entry, struct sgx_encl_mm, list);
> +
> +	if (!kref_get_unless_zero(&mm->refcount)) {
> +		*iter = SGX_ENCL_MM_ITER_RESTART;
> +		mm = NULL;
> +		goto out;
> +	}
> +
> +	if (!atomic_add_unless(&mm->mm->mm_count, 1, 0)) {

This is a use-after-free scenario if mm_count==0.  Once the count goes
to zero, __mmdrop() begins, at which point this code is racing against
free_mm().  What you want here (or rather, in flows where mm != current->mm
and you want to access PTEs) is mmget_not_zero(), i.e. "unless zero"
on mm_users.  mm_count prevents the mm_struct from being freed, but
doesn't protect the page tables.  mm_users protects the page tables,
i.e. lets us safely call sgx_encl_test_and_clear_young in the reclaimer.

To ensure liveliness of the mm itself, register an mmu_notifier for each
mm_struct (I think in sgx_vma_open()).  The enclave's .release callback
would then delete the mm from its list and drop its reference (exit_mmap()
holds a reference to mm_count so it's safe to do mmdrop() in the .release
callback).  E.g.:

static void sgx_vma_open(struct vm_area_struct *vma)
{
	...

	rcu_read_lock();
	list_for_each_entry_rcu(...) {
		if (vma->vm_mm == tmp->mm) {
			encl_mm = tmp;
			break;
		}
	}
	rcu_read_unlock();

	if (!encl_mm) {
		mm = kzalloc(sizeof(*mm), GFP_KERNEL);
		if (!mm) {
			goto error;

		encl_mm->encl = encl;
		encl_mm->mm = vma->vm_mm;

		if (mmu_notifier_register(&encl->mmu_notifier, encl_mm)) {
			kfree(encl_mm);
			goto error;
		}

		spin_lock(&encl->mm_lock);
		list_add(&encl_mm->list, &encl->mm_list);
		spin_unlock(&encl->mm_lock);
	}

	...
error:
	<not sure what should go here if we don't kill the enclave>
}

static void sgx_encl_mmu_release(struct mmu_notifier *mn, struct mm_struct *mm)
{
	struct sgx_encl_mm *encl_mm =
		container_of(mn, struct sgx_encl_mm, mmu_notifier);

	spin_lock(encl_mm->encl->mm_lock);
	list_del_rcu(&encl_mm->list);
	spin_unlock(encl_mm->encl->mm_lock);

	synchronize_rcu();

	mmdrop(mm);
}

Alternatively, the sgx_encl_mmu_release() could mark the encl_mm as dead
instead of removing it from the list, but I don't think that'd mesh well
with an RCU list, i.e. we'd need a regular lock-protected list and a
custom walker.

The only downside with the RCU approach that I can think of is that the
encl_mm would stay on the enclave's list until the enclave or the mm
itself died.  That could result in unnecessary IPIs during reclaim (or
invalidation), but that seems like a minor corner case that could be
avoided in userspace, e.g. don't mmap() an enclave unless you actually
plan on running it.

> +		kref_put(&mm->refcount, sgx_encl_release_mm);
> +		mm = NULL;
> +		*iter = SGX_ENCL_MM_ITER_RESTART;
> +		goto out;
> +	}
> +
> +	*iter = SGX_ENCL_MM_ITER_NEXT;
> +
> +out:
> +	spin_unlock(&encl->mm_lock);
> +	return mm;
> +}
>