Re: [PATCH] drm/i915: Restrict pagefault disabling to just around copy_from_user()

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Tue, 18 Oct 2016 09:54:23 +0100

On 18/10/2016 09:38, Chris Wilson wrote:
On Tue, Oct 18, 2016 at 09:22:56AM +0100, Tvrtko Ursulin wrote:
On 18/10/2016 09:17, Chris Wilson wrote:
On Tue, Oct 18, 2016 at 09:01:30AM +0100, Tvrtko Ursulin wrote:
On 17/10/2016 15:10, Chris Wilson wrote:
When handling execbuf relocations, we play a delicate dance with
pagefault. We first try to access the user pages underneath our
struct_mutex. However, if those pages were inside a GEM object, we may
trigger a pagefault and deadlock as i915_gem_fault() tries to
recursively acquire struct_mutex. Instead, we choose to disable
pagefaulting around the copy_from_user whilst inside the struct_mutex
and handle the EFAULT by falling back to a copy outside the
struct_mutex.

We however presumed that disabling pagefaults would be expensive. It is
just an operation on the local current task. Cheap enough that we can
restrict the disable/enable to the critical section around the copy, and
so avoid having to handle the atomic sections within the relocation
handling itself.

Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
Cc: Daniel Vetter <daniel.vetter@xxxxxxxx>
Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx>
---
  drivers/gpu/drm/i915/i915_gem_execbuffer.c | 64 +++++++++++++-----------------
  1 file changed, 28 insertions(+), 36 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 1d02e74ce62d..22342ad0e07f 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -551,20 +551,6 @@ repeat:
  	return 0;
  }
-static bool object_is_idle(struct drm_i915_gem_object *obj)
-{
-	unsigned long active = i915_gem_object_get_active(obj);
-	int idx;
-
-	for_each_active(active, idx) {
-		if (!i915_gem_active_is_idle(&obj->last_read[idx],
-					     &obj->base.dev->struct_mutex))
-			return false;
-	}
-
-	return true;
-}
-
  static int
  i915_gem_execbuffer_relocate_entry(struct drm_i915_gem_object *obj,
  				   struct eb_vmas *eb,
@@ -648,10 +634,6 @@ i915_gem_execbuffer_relocate_entry(struct drm_i915_gem_object *obj,
  		return -EINVAL;
  	}
-	/* We can't wait for rendering with pagefaults disabled */
-	if (pagefault_disabled() && !object_is_idle(obj))
-		return -EFAULT;
-
  	ret = relocate_entry(obj, reloc, cache, target_offset);
  	if (ret)
  		return ret;
@@ -678,12 +660,23 @@ i915_gem_execbuffer_relocate_vma(struct i915_vma *vma,
  	remain = entry->relocation_count;
  	while (remain) {
  		struct drm_i915_gem_relocation_entry *r = stack_reloc;
-		int count = remain;
-		if (count > ARRAY_SIZE(stack_reloc))
-			count = ARRAY_SIZE(stack_reloc);
+		unsigned long unwritten;
+		unsigned int count;
+
+		count = min_t(unsigned int, remain, ARRAY_SIZE(stack_reloc));
  		remain -= count;
-		if (__copy_from_user_inatomic(r, user_relocs, count*sizeof(r[0]))) {
+		/* This is the fast path and we cannot handle a pagefault
+		 * whilst holding the struct mutex lest the user pass in the
+		 * relocations contained within a mmaped bo. For in such a case
+		 * we, the page fault handler would call i915_gem_fault() and
+		 * we would try to acquire the struct mutex again. Obviously
+		 * this is bad and so lockdep complains vehemently.
+		 */
+		pagefault_disable();
+		unwritten = __copy_from_user_inatomic(r, user_relocs, count*sizeof(r[0]));
+		pagefault_enable();
+		if (unwritten) {
  			ret = -EFAULT;
  			goto out;
  		}
@@ -695,11 +688,19 @@ i915_gem_execbuffer_relocate_vma(struct i915_vma *vma,
  			if (ret)
  				goto out;
-			if (r->presumed_offset != offset &&
-			    __put_user(r->presumed_offset,
-				       &user_relocs->presumed_offset)) {
-				ret = -EFAULT;
-				goto out;
+			if (r->presumed_offset != offset) {
+				/* Copying back to the user is allowed to fail.
+				 * The information passed back is a hint as
+				 * to the final location. If the copy_to_user
+				 * fails after a successful copy_from_user,
+				 * it must be a readonly location and so
+				 * we presume the user knows what they are
+				 * doing!
+				 */
+				pagefault_disable();
+				__put_user(r->presumed_offset,
+					   &user_relocs->presumed_offset);
+				pagefault_enable();
Why is a good idea to ignore potential errors here?
Wrong question: why did we think it a good idea to ignore success here?
How were we ignoring errors, I don't follow?

(a) it is safe to do so, and I can legitimately setup userspace to use
this
How what where? :)
igt, whereelse, has a test case to check we can execute from read only
memory.

With relocations?

  
(b) reporting an error after we have committed the change is broken
anyway.
You mean in the half way through cases?
We've already made the user/gpu visible change. Failing here is broken.
(User doesn't notice the change, presumed_offset doesn't match actual,
kernel is oblivious to mismatch, GPU hangs.)

How is that? I see a error path propagating all the way up to failing 
the execbuf.

Regards,

Tvrtko

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx