Am 11.06.21 um 11:33 schrieb Daniel Vetter:
On Fri, Jun 11, 2021 at 09:42:07AM +0200, Christian König wrote:
Am 11.06.21 um 09:20 schrieb Daniel Vetter:
On Fri, Jun 11, 2021 at 8:55 AM Christian König
<christian.koenig@xxxxxxx> wrote:
Am 10.06.21 um 22:42 schrieb Daniel Vetter:
On Thu, Jun 10, 2021 at 10:10 PM Jason Ekstrand <jason@xxxxxxxxxxxxxx> wrote:
On Thu, Jun 10, 2021 at 8:35 AM Jason Ekstrand <jason@xxxxxxxxxxxxxx> wrote:
On Thu, Jun 10, 2021 at 6:30 AM Daniel Vetter <daniel.vetter@xxxxxxxx> wrote:
On Thu, Jun 10, 2021 at 11:39 AM Christian König
<christian.koenig@xxxxxxx> wrote:
Am 10.06.21 um 11:29 schrieb Tvrtko Ursulin:
On 09/06/2021 22:29, Jason Ekstrand wrote:
We've tried to keep it somewhat contained by doing most of the hard work
to prevent access of recycled objects via dma_fence_get_rcu_safe().
However, a quick grep of kernel sources says that, of the 30 instances
of dma_fence_get_rcu*, only 11 of them use dma_fence_get_rcu_safe().
It's likely there bear traps in DRM and related subsystems just waiting
for someone to accidentally step in them.
...because dma_fence_get_rcu_safe apears to be about whether the
*pointer* to the fence itself is rcu protected, not about the fence
object itself.
Yes, exactly that.
The fact that both of you think this either means that I've completely
missed what's going on with RCUs here (possible but, in this case, I
think unlikely) or RCUs on dma fences should scare us all.
Taking a step back for a second and ignoring SLAB_TYPESAFE_BY_RCU as
such, I'd like to ask a slightly different question: What are the
rules about what is allowed to be done under the RCU read lock and
what guarantees does a driver need to provide?
I think so far that we've all agreed on the following:
1. Freeing an unsignaled fence is ok as long as it doesn't have any
pending callbacks. (Callbacks should hold a reference anyway).
2. The pointer race solved by dma_fence_get_rcu_safe is real and
requires the loop to sort out.
But let's say I have a dma_fence pointer that I got from, say, calling
dma_resv_excl_fence() under rcu_read_lock(). What am I allowed to do
with it under the RCU lock? What assumptions can I make? Is this
code, for instance, ok?
rcu_read_lock();
fence = dma_resv_excl_fence(obj);
idle = !fence || test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags);
rcu_read_unlock();
This code very much looks correct under the following assumptions:
1. A valid fence pointer stays alive under the RCU read lock
2. SIGNALED_BIT is set-once (it's never unset after being set).
However, if it were, we wouldn't have dma_resv_test_singnaled(), now
would we? :-)
The moment you introduce ANY dma_fence recycling that recycles a
dma_fence within a single RCU grace period, all your assumptions break
down. SLAB_TYPESAFE_BY_RCU is just one way that i915 does this. We
also have a little i915_request recycler to try and help with memory
pressure scenarios in certain critical sections that also doesn't
respect RCU grace periods. And, as mentioned multiple times, our
recycling leaks into every other driver because, thanks to i915's
choice, the above 4-line code snippet isn't valid ANYWHERE in the
kernel.
So the question I'm raising isn't so much about the rules today.
Today, we live in the wild wild west where everything is YOLO. But
where do we want to go? Do we like this wild west world? So we want
more consistency under the RCU read lock? If so, what do we want the
rules to be?
One option would be to accept the wild-west world we live in and say
"The RCU read lock gains you nothing. If you want to touch the guts
of a dma_fence, take a reference". But, at that point, we're eating
two atomics for every time someone wants to look at a dma_fence. Do
we want that?
Alternatively, and this what I think Daniel and I were trying to
propose here, is that we place some constraints on dma_fence
recycling. Specifically that, under the RCU read lock, the fence
doesn't suddenly become a new fence. All of the immutability and
once-mutability guarantees of various bits of dma_fence hold as long
as you have the RCU read lock.
Yeah this is suboptimal. Too many potential bugs, not enough benefits.
This entire __rcu business started so that there would be a lockless
way to get at fences, or at least the exclusive one. That did not
really pan out. I think we have a few options:
- drop the idea of rcu/lockless dma-fence access outright. A quick
sequence of grabbing the lock, acquiring the dma_fence and then
dropping your lock again is probably plenty good. There's a lot of
call_rcu and other stuff we could probably delete. I have no idea what
the perf impact across all the drivers would be.
The question is maybe not the perf impact, but rather if that is
possible over all.
IIRC we now have some cases in TTM where RCU is mandatory and we simply
don't have any other choice than using it.
Adding Thomas Hellstrom.
Where is that stuff? If we end up with all the dma_resv locking
complexity just for an oddball, then I think that would be rather big
bummer.
This is during buffer destruction. See the call to dma_resv_copy_fences().
Ok yeah that's tricky.
The way solved this in i915 is with a trylock and punting to a worker
queue if the trylock fails. And the worker queue would also be flushed
from the shrinker (once we get there at least).
That's what we already had done here as well, but the worker is exactly
what we wanted to avoid by this.
So this looks fixable.
I'm not sure of that. We had really good reasons to remove the worker.
But that is basically just using a dma_resv function which accesses the
object without taking a lock.
The other one I've found is the ghost object, but that one is locked
fully.
- try to make all drivers follow some stricter rules. The trouble is
that at least with radeon dma_fence callbacks aren't even very
reliable (that's why it has its own dma_fence_wait implementation), so
things are wobbly anyway.
- live with the current situation, but radically delete all unsafe
interfaces. I.e. nothing is allowed to directly deref an rcu fence
pointer, everything goes through dma_fence_get_rcu_safe. The
kref_get_unless_zero would become an internal implementation detail.
Our "fast" and "lockless" dma_resv fence access stays a pile of
seqlock, retry loop and an a conditional atomic inc + atomic dec. The
only thing that's slightly faster would be dma_resv_test_signaled()
- I guess minimally we should rename dma_fence_get_rcu to
dma_fence_tryget. It has nothing to do with rcu really, and the use is
very, very limited.
I think what we should do is to use RCU internally in the dma_resv
object but disallow drivers/frameworks to mess with that directly.
In other words drivers should use one of the following:
1. dma_resv_wait_timeout()
2. dma_resv_test_signaled()
3. dma_resv_copy_fences()
4. dma_resv_get_fences()
5. dma_resv_for_each_fence() <- to be implemented
6. dma_resv_for_each_fence_unlocked() <- to be implemented
Inside those functions we then make sure that we only save ways of
accessing the RCU protected data structures.
This way we only need to make sure that those accessor functions are
sane and don't need to audit every driver individually.
Yeah better encapsulation for dma_resv sounds like a good thing, least
for all the other issues we've been discussing recently. I guess your
list is also missing the various "add/replace some more fences"
functions, but we have them already.
I can tackle implementing for the dma_res_for_each_fence()/_unlocked().
Already got a large bunch of that coded out anyway.
When/where do we need ot iterate over fences unlocked? Given how much
pain it is to get a consistent snapshot of the fences or fence state
(I've read the dma-buf poll implementation, and it looks a bit buggy
in that regard, but not sure, just as an example) and unlocked
iterator sounds very dangerous to me.
This is to make implementation of the other functions easier. Currently they
basically each roll their own loop implementation which at least for
dma_resv_test_signaled() looks a bit questionable to me.
Additionally to those we we have one more case in i915 and the unlocked
polling implementation which I agree is a bit questionable as well.
Yeah, the more I look at any of these lockless loop things the more I'm
worried. 90% sure the one in dma_buf_poll is broken too.
My idea is to have the problematic logic in the iterator and only give back
fence which have a reference and are 100% sure the right one.
Probably best if I show some code around to explain what I mean.
My gut feeling is that we should just try and convert them all over to
taking the dma_resv_lock. And if there is really a contention issue with
that, then either try to shrink it, or make it an rwlock or similar. But
just the more I read a lot of the implementations the more I see bugs and
have questions.
How about we abstract all that funny rcu dance inside the iterator instead?
I mean when we just have one walker function which is well documented
and understood then the rest becomes relatively easy.
Christian.
Maybe at the end a few will be left over, and then we can look at these
individually in detail. Like the ttm_bo_individualize_resv situation.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx