Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

Rob Clark <robdclark@xxxxxxxxx> · Thu, 10 Mar 2022 11:56:41 -0800

On Thu, Mar 10, 2022 at 11:44 AM Sharma, Shashank
<shashank.sharma@xxxxxxx> wrote:
>
>
>
> On 3/10/2022 8:35 PM, Rob Clark wrote:
> > On Thu, Mar 10, 2022 at 11:14 AM Sharma, Shashank
> > <shashank.sharma@xxxxxxx> wrote:
> >>
> >>
> >>
> >> On 3/10/2022 7:33 PM, Abhinav Kumar wrote:
> >>>
> >>>
> >>> On 3/10/2022 9:40 AM, Rob Clark wrote:
> >>>> On Thu, Mar 10, 2022 at 9:19 AM Sharma, Shashank
> >>>> <shashank.sharma@xxxxxxx> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 3/10/2022 6:10 PM, Rob Clark wrote:
> >>>>>> On Thu, Mar 10, 2022 at 8:21 AM Sharma, Shashank
> >>>>>> <shashank.sharma@xxxxxxx> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 3/10/2022 4:24 PM, Rob Clark wrote:
> >>>>>>>> On Thu, Mar 10, 2022 at 1:55 AM Christian König
> >>>>>>>> <ckoenig.leichtzumerken@xxxxxxxxx> wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Am 09.03.22 um 19:12 schrieb Rob Clark:
> >>>>>>>>>> On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
> >>>>>>>>>> <contactshashanksharma@xxxxxxxxx> wrote:
> >>>>>>>>>>> From: Shashank Sharma <shashank.sharma@xxxxxxx>
> >>>>>>>>>>>
> >>>>>>>>>>> This patch adds a new sysfs event, which will indicate
> >>>>>>>>>>> the userland about a GPU reset, and can also provide
> >>>>>>>>>>> some information like:
> >>>>>>>>>>> - process ID of the process involved with the GPU reset
> >>>>>>>>>>> - process name of the involved process
> >>>>>>>>>>> - the GPU status info (using flags)
> >>>>>>>>>>>
> >>>>>>>>>>> This patch also introduces the first flag of the flags
> >>>>>>>>>>> bitmap, which can be appended as and when required.
> >>>>>>>>>> Why invent something new, rather than using the already existing
> >>>>>>>>>> devcoredump?
> >>>>>>>>>
> >>>>>>>>> Yeah, that's a really valid question.
> >>>>>>>>>
> >>>>>>>>>> I don't think we need (or should encourage/allow) something drm
> >>>>>>>>>> specific when there is already an existing solution used by both
> >>>>>>>>>> drm
> >>>>>>>>>> and non-drm drivers.  Userspace should not have to learn to support
> >>>>>>>>>> yet another mechanism to do the same thing.
> >>>>>>>>>
> >>>>>>>>> Question is how is userspace notified about new available core
> >>>>>>>>> dumps?
> >>>>>>>>
> >>>>>>>> I haven't looked into it too closely, as the CrOS userspace
> >>>>>>>> crash-reporter already had support for devcoredump, so it "just
> >>>>>>>> worked" out of the box[1].  I believe a udev event is what triggers
> >>>>>>>> the crash-reporter to go read the devcore dump out of sysfs.
> >>>>>>>
> >>>>>>> I had a quick look at the devcoredump code, and it doesn't look like
> >>>>>>> that is sending an event to the user, so we still need an event to
> >>>>>>> indicate a GPU reset.
> >>>>>>
> >>>>>> There definitely is an event to userspace, I suspect somewhere down
> >>>>>> the device_add() path?
> >>>>>>
> >>>>>
> >>>>> Let me check that out as well, hope that is not due to a driver-private
> >>>>> event for GPU reset, coz I think I have seen some of those in a few DRM
> >>>>> drivers.
> >>>>>
> >>>>
> >>>> Definitely no driver private event for drm/msm .. I haven't dug
> >>>> through it all but this is the collector for devcoredump, triggered
> >>>> somehow via udev.  Most likely from event triggered by device_add()
> >>>>
> >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fchromium.googlesource.com%2Fchromiumos%2Fplatform2%2F%2B%2FHEAD%2Fcrash-reporter%2Fudev_collector.cc&amp;data=04%7C01%7Cshashank.sharma%40amd.com%7Cb4e920f125ae4d7de29708da02cd3112%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637825377562005233%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=M4xHPErex4vn7l3lNPgniiMp%2BKb3SpOHQo2QLAndxDQ%3D&amp;reserved=0
> >>>>
> >>>
> >>> Yes, that is correct. the uevent for devcoredump is from device_add()
> >>>
> >> Yes, I could confirm in the code that device_add() sends a uevent.
> >>
> >> kobject_uevent(&dev->kobj, KOBJ_ADD);
> >>
> >> I was trying to map the ChromiumOs's udev event rules with the event
> >> being sent from device_add(), what I could see is there is only one udev
> >> rule for any DRM subsystem events in ChromiumOs's 99-crash-reporter.rules:
> >>
> >> ACTION=="change", SUBSYSTEM=="drm", KERNEL=="card0", ENV{ERROR}=="1", \
> >>     RUN+="/sbin/crash_reporter
> >> --udev=KERNEL=card0:SUBSYSTEM=drm:ACTION=change"
> >>
> >> Can someone confirm that this is the rule which gets triggered when a
> >> devcoredump is generated ? I could not find an ERROR=1 string in the
> >> env[] while sending this event from dev_add();
> >
> > I think it is actually this rule:
> >
> > ACTION=="add", SUBSYSTEM=="devcoredump", \
> >    RUN+="/sbin/crash_reporter
> > --udev=SUBSYSTEM=devcoredump:ACTION=add:KERNEL_NUMBER=%n"
> >
> > It is something non-drm specific because it supports devcore dumps
> > from non drm drivers.  I know at least some of the wifi and remoteproc
> > drivers use it.
> >
>
> Ah, this seems like a problem for me. I understand it will work for a
> reset/recovery app well, but if a DRM client (like a compositor), who
> wants to listen only to DRM events (like a GPU reset), wouldn't this
> create a lot of noise for it ? Like every time any subsystem produces
> this coredump, there will be a change in devcoresump subsystem, and the
> client will have to parse the core file, and then will have to decide if
> it wants to react to this, or ignore.
>
> Wouldn't a GPU reset event, specific to DRM subsystem server better in
> such case ?
>

So, I suppose there are two different use-cases.. for something like
distro which has generic crash telemetry (ie. when users opt in to
automated crash reporting), and in general for debugging gpu crashes,
you want devcoredump, preferably with plenty of information about gpu
state, etc, so you actually have a chance of debugging problems you
can't necessarily reproduce locally.  Note also that mesa CI has some
limited support for collecting devcore dumps if a CI run triggers a
GPU fault.

For something like just notifying a compositor that a gpu crash
happened, perhaps drm_event is more suitable.  See
virtio_gpu_fence_event_create() for an example of adding new event
types.  Although maybe you want it to be an event which is not device
specific.  This isn't so much of a debugging use-case as simply
notification.

BR,
-R