Re: [PATCH] drm/i915: Convert WARNs during userptr revoke to SIGBUS

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Thu, 8 Oct 2015 10:45:47 +0100

On 28/09/15 15:14, Daniel Vetter wrote:
On Mon, Sep 28, 2015 at 02:52:30PM +0100, Chris Wilson wrote:
On Mon, Sep 28, 2015 at 03:42:22PM +0200, Daniel Vetter wrote:
On Wed, Sep 23, 2015 at 09:07:24PM +0100, Chris Wilson wrote:
If the client revokes the virtual address it asked to be mapped into GPU
space via userptr (by using anything along the lines of mmap, mprotect,
madvise, munmap, ftruncate etc) the mmu notifier sends a range
invalidate command to userptr. Upon receiving the invalidation signal
for the revoked range, we try to release the struct pages we pinned into
the GTT. However, this can fail if any of the GPU's VMA are pinned for
use by the hardware (i.e. despite the user's intention we cannot
relinquish the client's address range and keep uptodate with whatever is
placed in there). Currently we emit a few WARN so that we would notice
if this every occurred in the wild; it has. Sadly this means we need to
replace those WARNs with the proper SIGBUS to the offending clients
instead.

Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
Cc: Michał Winiarski <michal.winiarski@xxxxxxxxx>
---
  drivers/gpu/drm/i915/i915_gem_userptr.c | 41 +++++++++++++++++++++++++++++----
  1 file changed, 37 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index f75d90118888..efb404b9fe69 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -81,11 +81,44 @@ static void __cancel_userptr__worker(struct work_struct *work)
  		was_interruptible = dev_priv->mm.interruptible;
  		dev_priv->mm.interruptible = false;

-		list_for_each_entry_safe(vma, tmp, &obj->vma_list, obj_link) {
-			int ret = i915_vma_unbind(vma);
-			WARN_ON(ret && ret != -EIO);
+		list_for_each_entry_safe(vma, tmp, &obj->vma_list, obj_link)
+			i915_vma_unbind(vma);
+		if (i915_gem_object_put_pages(obj)) {
+			struct task_struct *p;
+
+			DRM_ERROR("Unable to revoke ownership by userptr of"
+				  " invalidated address range, sending SIGBUS"
+				  " to attached clients.\n");
+
+			rcu_read_lock();
+			for_each_process(p) {
+				siginfo_t info;
+
+				/* This doesn't capture everyone who has
+				 * the pages pinned behind a VMA as we
+				 * do not have that tracking information
+				 * available, it does however kill the
+				 * original process (and siblings) who
+				 * created the userptr and presumably tried
+				 * to reuse the address space despite having
+				 * pinned it (possibly indirectly) to the hw.
+				 * Arguably, we don't even want to kill the
+				 * other processes as they are not at fault,
+				 * likely to be a display server, and hopefully
+				 * will release the pages in due course once
+				 * the client is dead.
+				 */
+				if (p->mm != obj->userptr.mm->mm)
+					continue;
+
+				info.si_signo = SIGBUS;
+				info.si_errno = 0;
+				info.si_code = BUS_ADRERR;
+				info.si_addr = (void __user *)obj->userptr.ptr;
+				force_sig_info(SIGBUS, &info, p);
+			}
+			rcu_read_unlock();

Why do we need to send a SIGBUS? It won't tear down the offending gem bo,
any new users will hopefully get it, and abusing SIGBUS without the thread
actually doing a memory access is a bit surprising. DRM_DEBUG seems to be
the most we can do here I think - I think userspace being able to do this
is just a fundamental property of userptr.

It is not the bo that is at fault but the *client's* *address* *space*
that is being changed. It is equivalent to mmap on a truncated file i.e.
if the client tries to use its mmapping after it has truncated the file
it is scolded via SIGBUS.

But existing SIGBUS is thread-bound and comes with the fault address
attached. This is just the gpu being a bit unhappy, so the SIGBUS comes
out of complete nowhere to smack the userspace thread. Any kind of SIGBUS
catcher userspace has for other reasons might be supremely surprised by
this and do stupid things. Hence I don't think throwing SIGBUS here is
correct behaviour. And there doesn't seem to be anything else suitable
really.

Te offending address is provided with the signal as far as I can see.

I think it is fine to do this, even required since the alternative is 
for GPU to keep using random memory indefinitely and userspace never 
gets to know.

And I don't see any reason to keep the process running who did such an 
elementary and serious mistake.

Is the only concern that the process can catch it and not exit?

I am just not sure about the locking requirement for for_each_process 
since existing call sites give conflicting examples. I don't see how 
turning the preemption off can be safe without the tasklist lock but 
perhaps I am wrong, don't know.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/intel-gfx