Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name

Luben Tuikov <luben.tuikov@xxxxxxx> · Tue, 18 Jul 2023 01:45:44 -0400

On 2023-07-17 22:35, Asahi Lina wrote:
> On 18/07/2023 00.55, Christian König wrote:
>> Am 15.07.23 um 16:14 schrieb alyssa@xxxxxxxxxxxxx:
>>> 15 July 2023 at 00:03, "Luben Tuikov" <luben.tuikov@xxxxxxx> wrote:
>>>> On 2023-07-14 05:57, Christian König wrote:
>>>>
>>>>> Am 14.07.23 um 11:49 schrieb Asahi Lina:
>>>>>
>>>>>> On 14/07/2023 17.43, Christian König wrote:
>>>>>>
>>>>>    Am 14.07.23 um 10:21 schrieb Asahi Lina:
>>>>>    A signaled scheduler fence can outlive its scheduler, since fences are
>>>>>    independencly reference counted. Therefore, we can't reference the
>>>>>    scheduler in the get_timeline_name() implementation.
>>>>>
>>>>>    Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
>>>>>    dma-bufs reference fences from GPU schedulers that no longer exist.
>>>>>
>>>>>    Signed-off-by: Asahi Lina <lina@xxxxxxxxxxxxx>
>>>>>    ---
>>>>>       drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
>>>>>       drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
>>>>>       include/drm/gpu_scheduler.h              | 5 +++++
>>>>>       3 files changed, 14 insertions(+), 2 deletions(-)
>>>>>
>>>>>    diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>    b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>    index b2bbc8a68b30..17f35b0b005a 100644
>>>>>    --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>    +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>    @@ -389,7 +389,12 @@ static bool
>>>>>    drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>>>>                  /*
>>>>>                * Fence is from the same scheduler, only need to wait for
>>>>>    -         * it to be scheduled
>>>>>    +         * it to be scheduled.
>>>>>    +         *
>>>>>    +         * Note: s_fence->sched could have been freed and reallocated
>>>>>    +         * as another scheduler. This false positive case is okay,
>>>>>    as if
>>>>>    +         * the old scheduler was freed all of its jobs must have
>>>>>    +         * signaled their completion fences.
>>>>>
>>>>>    This is outright nonsense. As long as an entity for a scheduler exists
>>>>>    it is not allowed to free up this scheduler.
>>>>>
>>>>>    So this function can't be called like this.
>>>>>
>>>>>> As I already explained, the fences can outlive their scheduler. That
>>>>>>    means *this* entity certainly exists for *this* scheduler, but the
>>>>>>    *dependency* fence might have come from a past scheduler that was
>>>>>>    already destroyed along with all of its entities, and its address reused.
>>>>>>
>>>>>    
>>>>>    Well this is function is not about fences, this function is a callback
>>>>>    for the entity.
>>>>>    
>>>>>
>>>>>> Christian, I'm really getting tired of your tone. I don't appreciate
>>>>>>    being told my comments are "outright nonsense" when you don't even
>>>>>>    take the time to understand what the issue is and what I'm trying to
>>>>>>    do/document. If you aren't interested in working with me, I'm just
>>>>>>    going to give up on drm_sched, wait until Rust gets workqueue support,
>>>>>>    and reimplement it in Rust. You can keep your broken fence lifetime
>>>>>>    semantics and I'll do my own thing.
>>>>>>
>>>>>    
>>>>>    I'm certainly trying to help here, but you seem to have unrealistic
>>>>>    expectations.
>>>>>    
>>>>>    I perfectly understand what you are trying to do, but you don't seem to
>>>>>    understand that this functionality here isn't made for your use case.
>>>>>    
>>>>>    We can adjust the functionality to better match your requirements, but
>>>>>    you can't say it is broken because it doesn't work when you use it not
>>>>>    in the way it is intended to be used.
>>>>>
>>>> I believe "adjusting" functionality to fit some external requirements,
>>>> may have unintended consequences, requiring yet more and more "adjustments".
>>>> (Or may allow (new) drivers to do wild things which may lead to wild results. :-) )
>>>>
>>>> We need to be extra careful and wary of this.
>>> Either drm/scheduler is common code that we should use for our driver, in which case we need to "adjust" it to fit the requirements of a safe Rust abstraction usable for AGX.
>>
>> Well this is the fundamental disagreement we have. As far as I can see
>> you don't need to adjust anything in the common drm/scheduler code.
>>
>> That code works with quite a bunch of different drivers, including the
>> Intel XE which has similar requirements to your work here.
>>
>> We can talk about gradually improving the common code, but as Luben
>> already wrote as well this needs to be done very carefully.
>>
>>>    Or, drm/scheduler is not common code intended for drivers with our requirements, and then we need to be able to write our own scheduler.
>>>
>>> AMD has NAK'd both options, effectively NAK'ing the driver.
>>>
>>> I will ask a simple yes/no question: Should we use drm/sched?
>>
>> Well, yes.
>>
>>>
>>> If yes, it will need patches like these,
>>
>> No, you don't.
>>
>> First of all you need to try to adjust your driver to match the
>> requirements of drm/scheduler and *not* the other way around.
>>
>>>    and AMD needs to be ok with that and stop NAK'ing them on sight becuase they don't match the existing requirements.
>>>
>>> If no, we will write our own scheduler in Rust, and AMD needs to be ok with that and not NAK it on sight because it's not drm/sched.
>>>
>>> Which is it?
>>>
>>> Note if we write a Rust scheduler, drm/sched and amdgpu will be unaffected. If we do that and AMD comes back and NAKs it -- as said in this thread would "probably" happen -- then it is impossible for us to upstream a driver regardless of whether we use drm/sched.
>>>
>>> Lina has been polite and accommodating while AMD calls her code "outright nonsense" and gets "outright NAK"s, and puts her into an impossible catch-22 where no matter what she does it's NAK'd.
>>
>> Well as far as I can see I'm totally polite as well.
>>
>> Pointing out that approaches doesn't seem to make sense and NAKing
>> patches is a perfectly normal part of the review process.
>>
>> What you need to to is to take a step back and ask yourself why this
>> here is facing so much rejection from our side. I have to admit that I
>> don't seem to be good at explaining that, cause we are obviously talking
>> past each other, but you don't seem to try hard to understand what I'm
>> pointing out either.
>>
>>> That's not ok.
>>
>> As far as I can see it is.
>>
>> As maintainer of a commonly used component my first duty is to preserve
>> the status quo and prevent modifications which are not well thought
>> through. And to be honest those changes here strongly looks like Lina is
>> just adjusting the code to match her requirements without looking left
>> and right first.
>>
>> Regards,
>> Christian.
>>
>>
> 
> I give up. You are ignoring everything we say, and rejecting everything 
> we suggest. We've already explained why drm_sched doesn't work for us. 
> I'm tired of repeating the same explanation over and over again only to 
> be ignored and told I'm wrong.
> 
> I'll start working on a new, much simpler Rust-native scheduler based on 
> the workqueue Rust abstractions which are in review.
> 
> ~~ Lina
> 

Perhaps it is worth having a reset and just trying to clarify requirements
one at a time, even if that involves describing a change on a single line
in a single file.

The maintainer discourse is quite common. Its ultimate goal is to keep
things working. If we let some dependencies loose, or change some requirements,
it's conceivable that this may lead to further problems with new development
of current drivers, as well as new drivers. This will lead to more one-off fixes,
and more "adjustments" to the point where the core requirement is lost,
and the code has lost its purpose and meaning.

Maintainers usually see 10 moves ahead, while driver developers as such in their
role, are expressly concerned with their immediate need to get this and that
driver and its feature working.

We should perhaps concentrate on the core of the requirements--the very
root of what is deemed to need to be changed, to understand why, and
if there's a better way to achieve this, and in general a good reason
for the way to proceed forward, whichever way is taken.

Let's take it one thing at a time, slow and perhaps in a ELI5 manner to make
sure no one misses the other person's point.

Perhaps short and lucid, but complete emails would be best, with code quoted,
and scenarios explained. I know this takes a long time--it's not unusual for me
to take hours to write a single email and I'm exhausted after that.

We all mean well here. Hope we can make something good.
-- 
Regards,
Luben