On 1/20/21 4:29 PM, Jarkko Sakkinen wrote: > On Wed, Jan 20, 2021 at 09:35:28AM -0800, Sean Christopherson wrote: >> On Fri, Jan 15, 2021, jarkko@xxxxxxxxxx wrote: >>> From: Jarkko Sakkinen <jarkko@xxxxxxxxxx> >>> >>> The most trivial example of a race condition can be demonstrated with this >>> example where mm_list contains just one entry: >>> >>> CPU A CPU B >>> sgx_release() >>> sgx_mmu_notifier_release() >>> list_del_rcu() >>> sgx_encl_release() >>> synchronize_srcu() >>> cleanup_srcu_struct() >>> >>> To fix this, call synchronize_srcu() before checking whether mm_list is >>> empty in sgx_release(). >> Why haven't you included the splat that Haitao provided? That would go a long >> way to helping answer Boris' question about exactly what is broken... > I've lost the klog. Haitao said he thought it was this: > void cleanup_srcu_struct(struct srcu_struct *ssp) > { ... > if (WARN_ON(srcu_readers_active(ssp))) > return; /* Just leak it! */ Which would indicate that an 'encl' kref reached 0 while some other thread was inside a idx = srcu_read_lock(&encl->srcu); ... srcu_read_unlock(&encl->srcu, idx); critical section. A quick audit didn't turn up any obvious suspects, though. If that *is* it, it might be nice to try to catch the culprit at srcu_read_{un}lock() time. If there's ever a 0 refcount at those sites, it would be nice to spew a warning: idx = srcu_read_lock(&encl->srcu); WARN_ON(!atomic_read(&encl->refcount->refcount); ... WARN_ON(!atomic_read(&encl->refcount->refcount); srcu_read_unlock(&encl->srcu, idx); at each site.