Re: [PATCH] drm/ttm: remove fence_lock

Maarten Lankhorst <maarten.lankhorst@xxxxxxxxxxxxx> · Tue, 25 Mar 2014 15:23:11 +0100

Hey,

op 21-03-14 14:04, Thomas Hellstrom schreef:
On 03/21/2014 01:12 PM, Maarten Lankhorst wrote:
Hey,

op 21-03-14 09:27, Thomas Hellstrom schreef:
On 03/21/2014 12:55 AM, Dave Airlie wrote:
On Fri, Oct 19, 2012 at 3:04 AM, Jerome Glisse <j.glisse@xxxxxxxxx>
wrote:
On Thu, Oct 18, 2012 at 06:43:40PM +0200, Thomas Hellstrom wrote:
On 10/18/2012 04:45 PM, Maarten Lankhorst wrote:
Op 18-10-12 13:55, Thomas Hellstrom schreef:
On 10/18/2012 01:38 PM, Maarten Lankhorst wrote:
Op 18-10-12 13:02, Thomas Hellstrom schreef:
On 10/18/2012 10:37 AM, Maarten Lankhorst wrote:
Hey,

Op 18-10-12 09:59, Thomas Hellstrom schreef:
On 10/18/2012 09:28 AM, Thomas Hellstrom wrote:
Hi, Maarten,

As you know I have been having my doubts about this change.
To me it seems insane to be forced to read the fence
pointer under a
reserved lock, simply because when you take the reserve
lock, another
process may have it and there is a substantial chance that
that process
will also be waiting for idle while holding the reserve lock.
I think it makes perfect sense, the only times you want to
read the fence
is when you want to change the members protected by the
reservation.
No, that's not true. A typical case (or the only case)
is where you want to do a map with no_wait semantics. You will
want
to be able to access a buffer's results even if the eviction code
is about to schedule an unbind from the GPU, and have the buffer
reserved?
Well either block on reserve or return -EBUSY if reserved,
presumably the
former..

ttm_bo_vm_fault does the latter already, anyway
ttm_bo_vm_fault only trylocks to avoid a deadlock with mmap_sem,
it's really
a waiting reserve but for different reasons. Typically a
user-space app will keep
asynchronous maps to TTM during a buffer lifetime, and the
buffer lifetime may
be long as user-space apps keep caches.
That's not the same as accessing a buffer after the GPU is done
with it.
Ah indeed.
You don't need to hold the reservation while performing the
wait itself though,
you could check if ttm_bo_wait(no_wait_gpu = true) returns
-EBUSY, and if so
take a ref to the sync_obj member and then wait after
unreserving. You won't
reset sync_obj member to NULL in that case, but that should be
harmless.
This will allow you to keep the reservations fast and short.
Maybe a helper
would be appropriate for this since radeon and nouveau both
seem to do this.

The problem is that as long as other users are waiting for idle
with reservation
held, for example to switch GPU engine or to delete a GPU bind,
waiting
for reserve will in many case mean wait for GPU.
This example sounds inefficient, I know nouveau can do this, but
this essentially
just stalls the gpu entirely. I think guess you mean functions
like nouveau_gem_object_close?
It wouldn't surprise me if performance in nouveau could be
improved simply by
fixing those cases up instead, since it stalls the application
completely too for other uses.

Please see the Nouveau cpu_prep patch as well.

While there are a number of cases that can be fixed up, also in
Radeon, there's no way around that reservation
is a heavyweight lock that, particularly on simpler hardware without
support for fence ordering
with barriers and / or "semaphores" and accelerated eviction will be
held while waiting for idle.

As such, it is unsuitable to protect read access to the fence
pointer. If the intention is to keep a single fence pointer
I think the best solution is to allow reading the fence pointer
outside reservation, but make sure this can be done
atomically. If the intention is to protect an array or list of fence
pointers, I think a spinlock is the better solution.

/Thomas
Just wanted to point out that like Thomas i am concern about having to
have object reserved when waiting on its associated fence. I fear it
will hurt us somehow.

I will try to spend couple days stress testing your branch on radeon
trying to see if it hurts performance anyhow with current use case.

I've been trying to figure out what to do with Maarten's patches going
forward since they are essential for all kinds of SoC people,

However I'm still not 100% sure I believe either the fact that the
problem is anything more than a microoptimisation, and possibly a
premature one at that, this needs someone to start suggesting
benchmarks we can run and a driver set to run them on, otherwise I'm
starting to tend towards we are taking about an optimisation we can
fix later,

The other option is to somehow merge this stuff under an option that
allows us to test it using a command line option, but I don't think
that is sane either,

So Maarten, Jerome, Thomas, please start discussing actual real world
tests you think merging this code will affect and then I can make a
better consideration.

Dave.
Dave,

This is IMHO a fundamental design discussion, not a micro-optimization.

I'm pretty sure all sorts of horrendous things could be done to the DRM
design without affecting real world application performance.

In TTM data protection is primarily done with spinlocks: This serves two
purposes.

a) You can't unnecessarily hold a data protection lock while waiting for
GPU, which is typically a very stupid thing to do (perhaps not so in
this particular case) but when the sleep while atomic or locking
inversion kernel warning turns up, that should at least
ring a bell. Trying to implement dma-buf fencing using the TTM fencing
callbacks will AFAICT cause a locking inversion.

b) Spinlocks essentially go away on UP systems. The whole reservation
path was essentially lock-free on UP systems pre dma-buf integration,
and with very few atomic locking operations even on SMP systems. It was
all prompted by a test some years ago (by Eric Anholt IIRC) showing that
locking operations turned up quite high on Intel driver profiling.

If we start protecting data with reservations, when we also export the
reservation locks, so that people find them a convenient "serialize work
on this buffer" lock, all kind of interesting things might happen, and
we might end up in a situation
similar to protecting everything with struct_mutex.

This is why I dislike this change. In particular when there is a very
simple remedy.

But if I can get an ACK to convert the reservation object fence pointers
and associated operations on them to be rcu-safe when I have some time
left, I'd be prepared to accept this patch series in it's current state.
RCU could be a useful way to deal with this. But in my case I've shown
there are very few places where it's needed, core ttm does not need it
at all.
This is why I wanted to leave it to individual drivers to implement it.

I think kfree_rcu for free in the driver itself should be enough,
and obtaining in the driver would require something like this, similar
to whats in rcuref.txt:

rcu_read_lock();
f = rcu_dereference(bo->fence);
if (f && !kref_get_unless-zero(&f->kref)) f = NULL;
rcu_read_unlock();

if (f) {
// do stuff here
...

// free fence
kref_put(&f->kref, fence_put_with_kfree_rcu);
}

Am I wrong or is something like this enough to make sure fence is
still alive?
No, you're correct.

And a bo check for idle would amount to (since we don't care if the
fence refcount is zero).

rcu_read_lock()
f = rcu_dereference(bo->fence);
signaled = !f || f->signaled;
rcu_read_unlock().

/Thomas

This is very easy to implement when there is only 1 fence slot, bo->fence being equal to reservation_object->fence_excl.
It appears to break down when slightly when I implement it on top of shared fences.

My diff is against git://git.linaro.org/people/sumit.semwal/linux-3.x.git for-next-fences
shared_max_fence is held in reservation_object to prevent the reallocation in reservation_object_reserve_shared_fence
every time the same reservation_object gets > 4 shared fences; if it happens once, it's likely going to happen again on the same object.

Anyhow does the below look sane to you? This has only been tested by my compiler, I haven't checked if this boots.
---
commit c73b87d33fd08f7c1b0a381b08b3626128f8f3b8
Author: Maarten Lankhorst <maarten.lankhorst@xxxxxxxxxxxxx>
Date:   Tue Mar 25 15:10:21 2014 +0100

    add rcu to fence HACK WIP DO NOT USE KILLS YOUR POT PLANTS PETS AND FAMILY

diff --git a/drivers/base/dma-buf.c b/drivers/base/dma-buf.c
index 96338bf7f457..0a88c10b3db9 100644
--- a/drivers/base/dma-buf.c
+++ b/drivers/base/dma-buf.c
@@ -134,7 +134,10 @@ static unsigned int dma_buf_poll(struct file *file, poll_table *poll)
 {
 	struct dma_buf *dmabuf;
 	struct reservation_object *resv;
+	struct reservation_object_shared *shared;
+	struct fence *fence_excl;
 	unsigned long events;
+	unsigned shared_count;
 
 	dmabuf = file->private_data;
 	if (!dmabuf || !dmabuf->resv)
@@ -148,14 +151,18 @@ static unsigned int dma_buf_poll(struct file *file, poll_table *poll)
 	if (!events)
 		return 0;
 
-	ww_mutex_lock(&resv->lock, NULL);
+	rcu_read_lock();
 
-	if (resv->fence_excl && (!(events & POLLOUT) ||
-				 resv->fence_shared_count == 0)) {
+	shared = rcu_dereference(resv->shared);
+	fence_excl = rcu_dereference(resv->fence_excl);
+	shared_count = ACCESS_ONCE(shared->count);
+
+	if (fence_excl && (!(events & POLLOUT) ||
+				 (!shared || shared_count == 0))) {
 		struct dma_buf_poll_cb_t *dcb = &dmabuf->cb_excl;
 		unsigned long pevents = POLLIN;
 
-		if (resv->fence_shared_count == 0)
+		if (!shared || shared_count == 0)
 			pevents |= POLLOUT;
 
 		spin_lock_irq(&dmabuf->poll.lock);
@@ -167,19 +174,26 @@ static unsigned int dma_buf_poll(struct file *file, poll_table *poll)
 		spin_unlock_irq(&dmabuf->poll.lock);
 
 		if (events & pevents) {
-			if (!fence_add_callback(resv->fence_excl,
-						&dcb->cb, dma_buf_poll_cb))
+			if (!kref_get_unless_zero(&fence_excl->refcount)) {
+				/* force a recheck */
+				events &= ~pevents;
+				dma_buf_poll_cb(NULL, &dcb->cb);
+			} else if (!fence_add_callback(fence_excl, &dcb->cb,
+						       dma_buf_poll_cb)) {
 				events &= ~pevents;
-			else
+				fence_put(fence_excl);
+			} else {
 				/*
 				 * No callback queued, wake up any additional
 				 * waiters.
 				 */
+				fence_put(fence_excl);
 				dma_buf_poll_cb(NULL, &dcb->cb);
+			}
 		}
 	}
 
-	if ((events & POLLOUT) && resv->fence_shared_count > 0) {
+	if ((events & POLLOUT) && shared && shared_count > 0) {
 		struct dma_buf_poll_cb_t *dcb = &dmabuf->cb_shared;
 		int i;
 
@@ -194,20 +208,34 @@ static unsigned int dma_buf_poll(struct file *file, poll_table *poll)
 		if (!(events & POLLOUT))
 			goto out;
 
-		for (i = 0; i < resv->fence_shared_count; ++i)
-			if (!fence_add_callback(resv->fence_shared[i],
-						&dcb->cb, dma_buf_poll_cb)) {
+		for (i = 0; i < shared_count; ++i) {
+			struct fence *fence = ACCESS_ONCE(shared->fence[i]);
+			if (!kref_get_unless_zero(&fence->refcount)) {
+				/*
+				 * fence refcount dropped to zero, this means
+				 * that the shared object has been freed from under us.
+				 * call dma_buf_poll_cb and force a recheck!
+				 */
 				events &= ~POLLOUT;
+				dma_buf_poll_cb(NULL, &dcb->cb);
 				break;
 			}
+			if (!fence_add_callback(fence, &dcb->cb,
+						dma_buf_poll_cb)) {
+				fence_put(fence);
+				events &= ~POLLOUT;
+				break;
+			}
+			fence_put(fence);
+		}
 
 		/* No callback queued, wake up any additional waiters. */
-		if (i == resv->fence_shared_count)
+		if (i == shared_count)
 			dma_buf_poll_cb(NULL, &dcb->cb);
 	}
 
 out:
-	ww_mutex_unlock(&resv->lock);
+	rcu_read_unlock();
 	return events;
 }
 
diff --git a/drivers/base/fence.c b/drivers/base/fence.c
index 8fff13fb86cf..be03a9e8ad8b 100644
--- a/drivers/base/fence.c
+++ b/drivers/base/fence.c
@@ -170,7 +170,7 @@ void release_fence(struct kref *kref)
 	if (fence->ops->release)
 		fence->ops->release(fence);
 	else
-		kfree(fence);
+		kfree_rcu(fence, rcu);
 }
 EXPORT_SYMBOL(release_fence);
 
diff --git a/include/linux/fence.h b/include/linux/fence.h
index 65f2a01ee7e4..d19620405c34 100644
--- a/include/linux/fence.h
+++ b/include/linux/fence.h
@@ -40,6 +40,7 @@ struct fence_cb;
  * struct fence - software synchronization primitive
  * @refcount: refcount for this fence
  * @ops: fence_ops associated with this fence
+ * @rcu: used for releasing fence with kfree_rcu
  * @cb_list: list of all callbacks to call
  * @lock: spin_lock_irqsave used for locking
  * @context: execution context this fence belongs to, returned by
@@ -73,6 +74,7 @@ struct fence_cb;
 struct fence {
 	struct kref refcount;
 	const struct fence_ops *ops;
+	struct rcu_head rcu;
 	struct list_head cb_list;
 	spinlock_t *lock;
 	unsigned context, seqno;
diff --git a/include/linux/reservation.h b/include/linux/reservation.h
index f3f57460a205..91576fabafdb 100644
--- a/include/linux/reservation.h
+++ b/include/linux/reservation.h
@@ -42,6 +42,7 @@
 #include <linux/ww_mutex.h>
 #include <linux/fence.h>
 #include <linux/slab.h>
+#include <linux/rcupdate.h>
 
 extern struct ww_class reservation_ww_class;
 
@@ -49,8 +50,12 @@ struct reservation_object {
 	struct ww_mutex lock;
 
 	struct fence *fence_excl;
-	struct fence **fence_shared;
-	u32 fence_shared_count, fence_shared_max;
+	u32 shared_max_fence;
+	struct reservation_object_shared {
+		struct rcu_head rcu;
+		u32 count;
+		struct fence *fence[];
+	} *shared;
 };
 
 static inline void
@@ -58,8 +63,8 @@ reservation_object_init(struct reservation_object *obj)
 {
 	ww_mutex_init(&obj->lock, &reservation_ww_class);
 
-	obj->fence_shared_count = obj->fence_shared_max = 0;
-	obj->fence_shared = NULL;
+	obj->shared = NULL;
+	obj->shared_max_fence = 4;
 	obj->fence_excl = NULL;
 }
 
@@ -70,11 +75,97 @@ reservation_object_fini(struct reservation_object *obj)
 
 	if (obj->fence_excl)
 		fence_put(obj->fence_excl);
-	for (i = 0; i < obj->fence_shared_count; ++i)
-		fence_put(obj->fence_shared[i]);
-	kfree(obj->fence_shared);
+	if (obj->shared) {
+		for (i = 0; i < obj->shared->count; ++i)
+			fence_put(obj->shared->fence[i]);
+
+		/*
+		 * This object should be dead and all references must have
+		 * been released to it, so no need to free with rcu.
+		 */
+		kfree(obj->shared);
+	}
 
 	ww_mutex_destroy(&obj->lock);
 }
 
+/*
+ * Reserve space to add a shared fence to a reservation_object,
+ * must be called with obj->lock held.
+ */
+static inline int
+reservation_object_reserve_shared_fence(struct reservation_object *obj)
+{
+	struct reservation_object_shared *shared, *old;
+	u32 max = obj->shared_max_fence;
+
+	if (obj->shared) {
+		if (obj->shared->count < max)
+			return 0;
+		max *= 2;
+	}
+
+	shared = kmalloc(offsetof(typeof(*shared), fence[max]), GFP_KERNEL);
+	if (!shared)
+		return -ENOMEM;
+	old = obj->shared;
+
+	if (old) {
+		shared->count = old->count;
+		memcpy(shared->fence, old->fence, old->count * sizeof(*old->fence));
+	} else {
+		shared->count = 0;
+	}
+	rcu_assign_pointer(obj->shared, shared);
+	obj->shared_max_fence = max;
+	kfree_rcu(old, rcu);
+	return 0;
+}
+
+/*
+ * Add a fence to a shared slot, obj->lock must be held, and
+ * reservation_object_reserve_shared_fence has been called.
+ */
+
+static inline void
+reservation_object_add_shared_fence(struct reservation_object *obj,
+				    struct fence *fence)
+{
+	unsigned i;
+
+	BUG_ON(obj->shared->count == obj->shared_max_fence);
+	fence_get(fence);
+
+	for (i = 0; i < obj->shared->count; ++i)
+		if (obj->shared->fence[i]->context == fence->context) {
+			struct fence *old = obj->shared->fence[i];
+			rcu_assign_pointer(obj->shared->fence[i], fence);
+			fence_put(old);
+			return;
+		}
+
+	obj->shared->fence[obj->shared->count] = fence;
+	smp_wmb();
+	obj->shared->count++;
+}
+
+/*
+ * May be called after adding an exclusive to wipe all shared fences.
+ */
+
+static inline void
+reservation_object_clear_shared(struct reservation_object *obj)
+{
+	struct reservation_object_shared *old = obj->shared;
+	unsigned i;
+
+	if (!old)
+		return;
+
+	rcu_assign_pointer(obj->shared, NULL);
+	for (i = 0; i < old->count; ++i)
+		fence_put(old->fence[i]);
+	kfree_rcu(old, rcu);
+}
+
 #endif /* _LINUX_RESERVATION_H */

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/dri-devel