On 12/16/2024 7:09 PM, Christian König wrote: > Am 16.12.24 um 14:36 schrieb Lazar, Lijo: >>>>>> I had asked earlier about the utility of this one here. If this is just >>>>>> to inform userspace that driver has done a reset and recovered, it >>>>>> would >>>>>> need some additional context also. We have a mechanism in KFD which >>>>>> sends the context in which a reset has to be done. Currently, that's >>>>>> restricted to compute applications, but if this is in a similar >>>>>> line, we >>>>>> would like to pass some additional info like job timeout, RAS error >>>>>> etc. >>>>>> >>>>> DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a >>>>> reset and recovered, but additional data about like which job >>>>> timeout, RAS error and such belong to devcoredump I guess, where all >>>>> data is gathered and collected later. >>>> I think somebody else mentioned it as well that the source of the >>>> issue, e.g. the PID of the submitting process would be helpful as well >>>> for supervising daemons which need to restart processes when they >>>> caused some issue. >>>> >>> It was me :) we have a use case that we would need the PID for the >>> daemon indeed, but the daemon doesn't need to know what's the RAS error >>> or the job name that timeouted, there's no immediate action to be taken >>> with this information, contrary to the PID that we need to know. >>> >> Regarding devcoredump - it's not done every time. For ex: RAS errors >> have a different way to identify the source of error, hence we don't >> need a coredump in such cases. >> >> The intention is only to let the user know the reason for reset at a >> high level, and probably add more things later like the engines or >> queues that have reset etc. > > Well what is the use case for that? That doesn't looks valuable to me. It's mostly for in-band telemetry reporting through tools like amd-smi - more for admin purpose rather than any debug. Thanks, Lijo > > RAS errors should generally be reported to the application who issued > the submission. > > As a system wide event they are only useful in things like logfiles I think. > > Regards, > Christian. > >> Thanks, >> Lijo >> >>>> We just postponed adding that till later. >>>> >>>> Regards, >>>> Christian. >>>> >>>>>> Thanks, >>>>>> Lijo >>>>>> >>>>>>> Regards, >>>>>>> Christian. >