Re: [PATCH 1/1] drm/amdgpu: Use device wedged event

Christian König <christian.koenig@xxxxxxx> · Mon, 16 Dec 2024 14:39:43 +0100

    Am 16.12.24 um 14:36 schrieb Lazar, Lijo:

              I had asked earlier about the utility of this one here. If this is just
to inform userspace that driver has done a reset and recovered, it
would
need some additional context also. We have a mechanism in KFD which
sends the context in which a reset has to be done. Currently, that's
restricted to compute applications, but if this is in a similar
line, we
would like to pass some additional info like job timeout, RAS error
etc.

            DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a
reset and recovered, but additional data about like which job
timeout, RAS error and such belong to devcoredump I guess, where all
data is gathered and collected later.

          I think somebody else mentioned it as well that the source of the
issue, e.g. the PID of the submitting process would be helpful as well
for supervising daemons which need to restart processes when they
caused some issue.

        It was me :) we have a use case that we would need the PID for the
daemon indeed, but the daemon doesn't need to know what's the RAS error
or the job name that timeouted, there's no immediate action to be taken
with this information, contrary to the PID that we need to know.

      Regarding devcoredump - it's not done every time. For ex: RAS errors
have a different way to identify the source of error, hence we don't
need a coredump in such cases.

The intention is only to let the user know the reason for reset at a
high level, and probably add more things later like the engines or
queues that have reset etc.

    Well what is the use case for that? That doesn't looks valuable to
    me.

    RAS errors should generally be reported to the application who
    issued the submission.

    As a system wide event they are only useful in things like logfiles
    I think.

    Regards,

    Christian.

Thanks,
Lijo

          We just postponed adding that till later.

Regards,
Christian.

              Thanks,
Lijo

                Regards,
Christian.