Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is ready

JingWen Chen <jingwech@xxxxxxx> · Fri, 25 Feb 2022 11:13:34 +0800

Hi Andrey,

Sorry for the misleading, I mean the whole patch series. We are depending on this patch series to fix the concurrency issue within SRIOV TDR sequence.

On 2/25/22 1:26 AM, Andrey Grodzovsky wrote:
> No problem if so but before I do,
>
>
> JingWen - why you think this patch is needed as a standalone now ? It has no use without the
> entire feature together with it. Is it some changes you want to do on top of that code ?
>
>
> Andrey
>
>
> On 2022-02-24 12:12, Deucher, Alexander wrote:
>>
>> [Public]
>>
>>
>> If it applies cleanly, feel free to drop it in.  I'll drop those patches for drm-next since they are already in drm-misc.
>>
>> Alex
>>
>> ------------------------------------------------------------------------
>> *From:* amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> on behalf of Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
>> *Sent:* Thursday, February 24, 2022 11:24 AM
>> *To:* Chen, JingWen <JingWen.Chen2@xxxxxxx>; Christian König <ckoenig.leichtzumerken@xxxxxxxxx>; dri-devel@xxxxxxxxxxxxxxxxxxxxx <dri-devel@xxxxxxxxxxxxxxxxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>
>> *Cc:* Liu, Monk <Monk.Liu@xxxxxxx>; Chen, Horace <Horace.Chen@xxxxxxx>; Lazar, Lijo <Lijo.Lazar@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>; daniel@xxxxxxxx <daniel@xxxxxxxx>
>> *Subject:* Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is ready
>> No because all the patch-set including this patch was landed into
>> drm-misc-next and will reach amd-staging-drm-next on the next upstream
>> rebase i guess.
>>
>> Andrey
>>
>> On 2022-02-24 01:47, JingWen Chen wrote:
>> > Hi Andrey,
>> >
>> > Will you port this patch into amd-staging-drm-next?
>> >
>> > on 2/10/22 2:06 AM, Andrey Grodzovsky wrote:
>> >> All comments are fixed and code pushed. Thanks for everyone
>> >> who helped reviewing.
>> >>
>> >> Andrey
>> >>
>> >> On 2022-02-09 02:53, Christian König wrote:
>> >>> Am 09.02.22 um 01:23 schrieb Andrey Grodzovsky:
>> >>>> Before we initialize schedulers we must know which reset
>> >>>> domain are we in - for single device there iis a single
>> >>>> domain per device and so single wq per device. For XGMI
>> >>>> the reset domain spans the entire XGMI hive and so the
>> >>>> reset wq is per hive.
>> >>>>
>> >>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
>> >>> One more comment below, with that fixed Reviewed-by: Christian König <christian.koenig@xxxxxxx>.
>> >>>
>> >>>> ---
>> >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 45 ++++++++++++++++++++++
>> >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 34 ++--------------
>> >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  2 +
>> >>>>    3 files changed, 51 insertions(+), 30 deletions(-)
>> >>>>
>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> >>>> index 9704b0e1fd82..00123b0013d3 100644
>> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> >>>> @@ -2287,6 +2287,47 @@ static int amdgpu_device_fw_loading(struct amdgpu_device *adev)
>> >>>>        return r;
>> >>>>    }
>> >>>>    +static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
>> >>>> +{
>> >>>> +    long timeout;
>> >>>> +    int r, i;
>> >>>> +
>> >>>> +    for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>> >>>> +        struct amdgpu_ring *ring = adev->rings[i];
>> >>>> +
>> >>>> +        /* No need to setup the GPU scheduler for rings that don't need it */
>> >>>> +        if (!ring || ring->no_scheduler)
>> >>>> +            continue;
>> >>>> +
>> >>>> +        switch (ring->funcs->type) {
>> >>>> +        case AMDGPU_RING_TYPE_GFX:
>> >>>> +            timeout = adev->gfx_timeout;
>> >>>> +            break;
>> >>>> +        case AMDGPU_RING_TYPE_COMPUTE:
>> >>>> +            timeout = adev->compute_timeout;
>> >>>> +            break;
>> >>>> +        case AMDGPU_RING_TYPE_SDMA:
>> >>>> +            timeout = adev->sdma_timeout;
>> >>>> +            break;
>> >>>> +        default:
>> >>>> +            timeout = adev->video_timeout;
>> >>>> +            break;
>> >>>> +        }
>> >>>> +
>> >>>> +        r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
>> >>>> + ring->num_hw_submission, amdgpu_job_hang_limit,
>> >>>> +                   timeout, adev->reset_domain.wq, ring->sched_score, ring->name);
>> >>>> +        if (r) {
>> >>>> +            DRM_ERROR("Failed to create scheduler on ring %s.\n",
>> >>>> +                  ring->name);
>> >>>> +            return r;
>> >>>> +        }
>> >>>> +    }
>> >>>> +
>> >>>> +    return 0;
>> >>>> +}
>> >>>> +
>> >>>> +
>> >>>>    /**
>> >>>>     * amdgpu_device_ip_init - run init for hardware IPs
>> >>>>     *
>> >>>> @@ -2419,6 +2460,10 @@ static int amdgpu_device_ip_init(struct amdgpu_device *adev)
>> >>>>            }
>> >>>>        }
>> >>>>    +    r = amdgpu_device_init_schedulers(adev);
>> >>>> +    if (r)
>> >>>> +        goto init_failed;
>> >>>> +
>> >>>>        /* Don't init kfd if whole hive need to be reset during init */
>> >>>>        if (!adev->gmc.xgmi.pending_reset)
>> >>>> amdgpu_amdkfd_device_init(adev);
>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> >>>> index 45977a72b5dd..fa302540c69a 100644
>> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> >>>> @@ -457,8 +457,6 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
>> >>>>                      atomic_t *sched_score)
>> >>>>    {
>> >>>>        struct amdgpu_device *adev = ring->adev;
>> >>>> -    long timeout;
>> >>>> -    int r;
>> >>>>          if (!adev)
>> >>>>            return -EINVAL;
>> >>>> @@ -478,36 +476,12 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
>> >>>> spin_lock_init(&ring->fence_drv.lock);
>> >>>>        ring->fence_drv.fences = kcalloc(num_hw_submission * 2, sizeof(void *),
>> >>>>                         GFP_KERNEL);
>> >>>> -    if (!ring->fence_drv.fences)
>> >>>> -        return -ENOMEM;
>> >>>>    -    /* No need to setup the GPU scheduler for rings that don't need it */
>> >>>> -    if (ring->no_scheduler)
>> >>>> -        return 0;
>> >>>> +    ring->num_hw_submission = num_hw_submission;
>> >>>> +    ring->sched_score = sched_score;
>> >>> Let's move this into the caller and then use ring->num_hw_submission in the fence code as well.
>> >>>
>> >>> The maximum number of jobs on the ring is not really fence specific.
>> >>>
>> >>> Regards,
>> >>> Christian.
>> >>>
>> >>>>    -    switch (ring->funcs->type) {
>> >>>> -    case AMDGPU_RING_TYPE_GFX:
>> >>>> -        timeout = adev->gfx_timeout;
>> >>>> -        break;
>> >>>> -    case AMDGPU_RING_TYPE_COMPUTE:
>> >>>> -        timeout = adev->compute_timeout;
>> >>>> -        break;
>> >>>> -    case AMDGPU_RING_TYPE_SDMA:
>> >>>> -        timeout = adev->sdma_timeout;
>> >>>> -        break;
>> >>>> -    default:
>> >>>> -        timeout = adev->video_timeout;
>> >>>> -        break;
>> >>>> -    }
>> >>>> -
>> >>>> -    r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
>> >>>> -               num_hw_submission, amdgpu_job_hang_limit,
>> >>>> -               timeout, NULL, sched_score, ring->name);
>> >>>> -    if (r) {
>> >>>> -        DRM_ERROR("Failed to create scheduler on ring %s.\n",
>> >>>> -              ring->name);
>> >>>> -        return r;
>> >>>> -    }
>> >>>> +    if (!ring->fence_drv.fences)
>> >>>> +        return -ENOMEM;
>> >>>>          return 0;
>> >>>>    }
>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>> >>>> index fae7d185ad0d..7f20ce73a243 100644
>> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>> >>>> @@ -251,6 +251,8 @@ struct amdgpu_ring {
>> >>>>        bool has_compute_vm_bug;
>> >>>>        bool            no_scheduler;
>> >>>>        int            hw_prio;
>> >>>> +    unsigned num_hw_submission;
>> >>>> +    atomic_t        *sched_score;
>> >>>>    };
>> >>>>      #define amdgpu_ring_parse_cs(r, p, ib) ((r)->funcs->parse_cs((p), (ib)))