Re: [PATCH v4] x86/sgx: Fix the call order of synchronize_srcu() in sgx_release()

Dave Hansen <dave.hansen@xxxxxxxxx> · Wed, 20 Jan 2021 17:19:44 -0800

On 1/20/21 4:29 PM, Jarkko Sakkinen wrote:
> On Wed, Jan 20, 2021 at 09:35:28AM -0800, Sean Christopherson wrote:
>> On Fri, Jan 15, 2021, jarkko@xxxxxxxxxx wrote:
>>> From: Jarkko Sakkinen <jarkko@xxxxxxxxxx>
>>>
>>> The most trivial example of a race condition can be demonstrated with this
>>> example where mm_list contains just one entry:
>>>
>>> CPU A                   CPU B
>>> sgx_release()
>>>                         sgx_mmu_notifier_release()
>>>                         list_del_rcu()
>>> sgx_encl_release()
>>>                         synchronize_srcu()
>>> cleanup_srcu_struct()
>>>
>>> To fix this, call synchronize_srcu() before checking whether mm_list is
>>> empty in sgx_release().
>> Why haven't you included the splat that Haitao provided?  That would go a long
>> way to helping answer Boris' question about exactly what is broken...
> I've lost the klog.

Haitao said he thought it was this:

> void cleanup_srcu_struct(struct srcu_struct *ssp)
> {
...
>         if (WARN_ON(srcu_readers_active(ssp)))
>                 return; /* Just leak it! */

Which would indicate that an 'encl' kref reached 0 while some other
thread was inside a

        idx = srcu_read_lock(&encl->srcu);
	...
	srcu_read_unlock(&encl->srcu, idx);

critical section.  A quick audit didn't turn up any obvious suspects,
though.

If that *is* it, it might be nice to try to catch the culprit at
srcu_read_{un}lock() time.  If there's ever a 0 refcount at those sites,
it would be nice to spew a warning:

        idx = srcu_read_lock(&encl->srcu);
	WARN_ON(!atomic_read(&encl->refcount->refcount);
	...
	WARN_ON(!atomic_read(&encl->refcount->refcount);
	srcu_read_unlock(&encl->srcu, idx);

at each site.