On 14/08/24 14:42, Raag Jadav wrote: > On Wed, Aug 14, 2024 at 12:16:41PM +0530, Aravind Iddamsetty wrote: >> On 13/08/24 22:24, Rodrigo Vivi wrote: >>> On Tue, Aug 13, 2024 at 04:28:32PM +0300, Raag Jadav wrote: >>>> On Mon, Aug 12, 2024 at 03:08:14PM +0530, Aravind Iddamsetty wrote: >>>>> On 12/08/24 13:18, Raag Jadav wrote: >>>>>> From: Himal Prasad Ghimiray <himal.prasad.ghimiray@xxxxxxxxx> >>>>>> >>>>>> This was dropped in commit 77a0d4d1cea2 ("drm/xe/uapi: Remove reset >>>>>> uevent for now") as part of refactoring. >>>>>> >>>>>> Now that we have better uapi semantics and naming for the uevent, >>>>>> bring it back. With this in place, userspace will be notified of >>>>>> wedged device along with its reason. >>>>>> >>>>>> $ udevadm monitor --property --kernel >>>>>> monitor will print the received events for: >>>>>> KERNEL - the kernel uevent >>>>>> >>>>>> KERNEL[871.188570] change /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0 (pci) >>>>>> ACTION=change >>>>>> DEVPATH=/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0 >>>>>> SUBSYSTEM=pci >>>>>> DEVICE_STATUS=NEEDS_RESET >>>>>> REASON=GT_RESET_FAILED >>>>>> TILE_ID=0 >>>>>> GT_ID=0 >>>>>> DRIVER=xe >>>>>> PCI_CLASS=30000 >>>>>> PCI_ID=8086:56B1 >>>>>> PCI_SUBSYS_ID=8086:1210 >>>>>> PCI_SLOT_NAME=0000:03:00.0 >>>>>> MODALIAS=pci:v00008086d000056B1sv00008086sd00001210bc03sc00i00 >>>>>> SEQNUM=6104 >>>>>> >>>>>> v2: Change authorship to Himal (Aravind) >>>>>> Add uevent for all device wedged cases (Aravind) >>>>>> >>>>>> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@xxxxxxxxx> >>>>>> Signed-off-by: Raag Jadav <raag.jadav@xxxxxxxxx> >>>>>> --- >>>>>> drivers/gpu/drm/xe/xe_device.c | 10 +++++++++- >>>>>> drivers/gpu/drm/xe/xe_device.h | 2 +- >>>>>> drivers/gpu/drm/xe/xe_gt.c | 23 +++++++++++++++++++---- >>>>>> drivers/gpu/drm/xe/xe_guc.c | 13 ++++++++++++- >>>>>> drivers/gpu/drm/xe/xe_guc_submit.c | 13 ++++++++++++- >>>>>> include/uapi/drm/xe_drm.h | 29 +++++++++++++++++++++++++++++ >>>>>> 6 files changed, 82 insertions(+), 8 deletions(-) >>>>>> >>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c >>>>>> index 1aba6f9eaa19..d975bdce4a7d 100644 >>>>>> --- a/drivers/gpu/drm/xe/xe_device.c >>>>>> +++ b/drivers/gpu/drm/xe/xe_device.c >>>>>> @@ -955,6 +955,7 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg) >>>>>> /** >>>>>> * xe_device_declare_wedged - Declare device wedged >>>>>> * @xe: xe device instance >>>>>> + * @event_params: parameters to be sent along with uevent >>>>>> * >>>>>> * This is a final state that can only be cleared with a mudule >>>>>> * re-probe (unbind + bind). >>>>>> @@ -965,8 +966,10 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg) >>>>>> * on every single execution timeout (a.k.a. GPU hang) right after devcoredump >>>>>> * snapshot capture. In this mode, GT reset won't be attempted so the state of >>>>>> * the issue is preserved for further debugging. >>>>>> + * Caller is expected to pass respective parameters to be sent along with >>>>>> + * uevent. Pass NULL in case of no params. >>>>>> */ >>>>>> -void xe_device_declare_wedged(struct xe_device *xe) >>>>>> +void xe_device_declare_wedged(struct xe_device *xe, char **event_params) >>>>>> { >>>>>> struct xe_gt *gt; >>>>>> u8 id; >>>>>> @@ -984,12 +987,17 @@ void xe_device_declare_wedged(struct xe_device *xe) >>>>>> xe_pm_runtime_get_noresume(xe); >>>>>> >>>>>> if (!atomic_xchg(&xe->wedged.flag, 1)) { >>>>>> + struct pci_dev *pdev = to_pci_dev(xe->drm.dev); >>>>>> + >>>>>> xe->needs_flr_on_fini = true; >>>>>> drm_err(&xe->drm, >>>>>> "CRITICAL: Xe has declared device %s as wedged.\n" >>>>>> "IOCTLs and executions are blocked. Only a rebind may clear the failure\n" >>>>>> "Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/xe/kernel/issues/new\n", >>>>>> dev_name(xe->drm.dev)); >>>>>> + >>>>>> + /* Notify userspace about reset required */ >>>>>> + kobject_uevent_env(&pdev->dev.kobj, KOBJ_CHANGE, event_params); >>>>>> } >>>>>> >>>>>> for_each_gt(gt, xe, id) >>>>>> diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h >>>>>> index db6cc8d0d6b8..5d40fc6f0904 100644 >>>>>> --- a/drivers/gpu/drm/xe/xe_device.h >>>>>> +++ b/drivers/gpu/drm/xe/xe_device.h >>>>>> @@ -174,7 +174,7 @@ static inline bool xe_device_wedged(struct xe_device *xe) >>>>>> return atomic_read(&xe->wedged.flag); >>>>>> } >>>>>> >>>>>> -void xe_device_declare_wedged(struct xe_device *xe); >>>>>> +void xe_device_declare_wedged(struct xe_device *xe, char **reset_event); >>>>>> >>>>>> struct xe_file *xe_file_get(struct xe_file *xef); >>>>>> void xe_file_put(struct xe_file *xef); >>>>>> diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c >>>>>> index 58895ed22f6e..519f3c2cf9e2 100644 >>>>>> --- a/drivers/gpu/drm/xe/xe_gt.c >>>>>> +++ b/drivers/gpu/drm/xe/xe_gt.c >>>>>> @@ -741,6 +741,24 @@ static int do_gt_restart(struct xe_gt *gt) >>>>>> return 0; >>>>>> } >>>>>> >>>>>> +static void xe_gt_reset_failed(struct xe_gt *gt, int err) >>>>>> +{ >>>>>> + char *event_params[5]; >>>>>> + >>>>>> + xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err)); >>>>>> + >>>>>> + event_params[0] = DRM_XE_RESET_REQUIRED_UEVENT; >>>>>> + event_params[1] = DRM_XE_RESET_REQUIRED_UEVENT_REASON_GT; >>>>>> + event_params[2] = kasprintf(GFP_KERNEL, "TILE_ID=%d", gt_to_tile(gt)->id); >>>>>> + event_params[3] = kasprintf(GFP_KERNEL, "GT_ID=%d", gt->info.id); >>>>> the TILE_ID and GT_ID can be passed for other events as well, with that you can >>>>> have a common function to send uevent which would take reason as an input. >>>> But is that required for all cases? There could be potential cases atleast >>>> in the future where it is not needed. >> >> At least in these cases it makes sense as they (other reasons) >> can be associated to a GT and a Tile. If in future they arises a >> reason where these details are not needed i guess we can handle that. > But then we'll have to modify it with every new addition, which doesn't > look like a win. With current implementation the callers atleast have > the autonomy to send params as-needed. My intention is to have a common code at least for these events as much as possible as all the fields are same except reason. The only place in future that might not have a tile or gt is for gobal (entire device) level event for that I thought we can have a check based on reason type. Otherwise most of the reasons can be associated to a Tile or a GT. Thanks, Aravind, > > Raag