I had asked earlier about the utility of this one here. If this is just to inform userspace that driver has done a reset and recovered, it would need some additional context also. We have a mechanism in KFD which sends the context in which a reset has to be done. Currently, that's restricted to compute applications, but if this is in a similar line, we would like to pass some additional info like job timeout, RAS error etc.DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a reset and recovered, but additional data about like which job timeout, RAS error and such belong to devcoredump I guess, where all data is gathered and collected later.I think somebody else mentioned it as well that the source of the issue, e.g. the PID of the submitting process would be helpful as well for supervising daemons which need to restart processes when they caused some issue.It was me :) we have a use case that we would need the PID for the daemon indeed, but the daemon doesn't need to know what's the RAS error or the job name that timeouted, there's no immediate action to be taken with this information, contrary to the PID that we need to know.Regarding devcoredump - it's not done every time. For ex: RAS errors have a different way to identify the source of error, hence we don't need a coredump in such cases. The intention is only to let the user know the reason for reset at a high level, and probably add more things later like the engines or queues that have reset etc.
Well what is the use case for that? That doesn't looks valuable to me.
RAS errors should generally be reported to the application who issued the submission.
As a system wide event they are only useful in things like logfiles I think.
Regards,
Christian.
Thanks, LijoWe just postponed adding that till later. Regards, Christian.Thanks, LijoRegards, Christian.