On Wed, Nov 25, 2020 at 12:39:47PM -0500, Andrey Grodzovsky wrote: > > On 11/25/20 4:04 AM, Daniel Vetter wrote: > > On Tue, Nov 24, 2020 at 11:27 PM Andrey Grodzovsky > > <Andrey.Grodzovsky@xxxxxxx> wrote: > > > > > > On 11/24/20 9:49 AM, Daniel Vetter wrote: > > > > On Sat, Nov 21, 2020 at 12:21:20AM -0500, Andrey Grodzovsky wrote: > > > > > Avoids NULL ptr due to kobj->sd being unset on device removal. > > > > > > > > > > Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> > > > > > --- > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4 +++- > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c | 4 +++- > > > > > 2 files changed, 6 insertions(+), 2 deletions(-) > > > > > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > > > > > index caf828a..812e592 100644 > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > > > > > @@ -27,6 +27,7 @@ > > > > > #include <linux/uaccess.h> > > > > > #include <linux/reboot.h> > > > > > #include <linux/syscalls.h> > > > > > +#include <drm/drm_drv.h> > > > > > > > > > > #include "amdgpu.h" > > > > > #include "amdgpu_ras.h" > > > > > @@ -1043,7 +1044,8 @@ static int amdgpu_ras_sysfs_remove_feature_node(struct amdgpu_device *adev) > > > > > .attrs = attrs, > > > > > }; > > > > > > > > > > - sysfs_remove_group(&adev->dev->kobj, &group); > > > > > + if (!drm_dev_is_unplugged(&adev->ddev)) > > > > > + sysfs_remove_group(&adev->dev->kobj, &group); > > > > This looks wrong. sysfs, like any other interface, should be > > > > unconditionally thrown out when we do the drm_dev_unregister. Whether > > > > hotunplugged or not should matter at all. Either this isn't needed at all, > > > > or something is wrong with the ordering here. But definitely fishy. > > > > -Daniel > > > > > > So technically this is needed because kobejct's sysfs directory entry kobj->sd > > > is set to NULL > > > on device removal (from sysfs_remove_dir) but because we don't finalize the device > > > until last reference to drm file is dropped (which can happen later) we end up > > > calling sysfs_remove_file/dir after > > > this pointer is NULL. sysfs_remove_file checks for NULL and aborts while > > > sysfs_remove_dir > > > is not and that why I guard against calls to sysfs_remove_dir. > > > But indeed the whole approach in the driver is incorrect, as Greg pointed out - > > > we should use > > > default groups attributes instead of explicit calls to sysfs interface and this > > > would save those troubles. > > > But again. the issue here of scope of work, converting all of amdgpu to default > > > groups attributes is somewhat > > > lengthy process with extra testing as the entire driver is papered with sysfs > > > references and seems to me more of a standalone > > > cleanup, just like switching to devm_ and drmm_ work. To me at least it seems > > > that it makes more sense > > > to finalize and push the hot unplug patches so that this new functionality can > > > be part of the driver sooner > > > and then incrementally improve it by working on those other topics. Just as > > > devm_/drmm_ I also added sysfs cleanup > > > to my TODO list in the RFC patch. > > Hm, whether you solve this with the default group stuff to > > auto-remove, or remove explicitly at the right time doesn't matter > > much. The underlying problem you have here is that it's done way too > > late. > > As far as I understood correctly the default group attrs by reading this > article by Greg - https://www.linux.com/news/how-create-sysfs-file-correctly/ > it will be removed together with the device and not too late like now and I quote > from the last paragraph there: > > "By setting this value, you don’t have to do anything in your > probe() or release() functions at all in order for the > sysfs files to be properly created and destroyed whenever your > device is added or removed from the system. And you will, most > importantly, do it in a race-free manner, which is always a good thing." > > To me this seems like the best solution to the late remove issue. What do > you think ? > > > > sysfs removal (like all uapi interfaces) need to be removed as > > part of drm_dev_unregister. > > > Do you mean we need to trace and aggregate all sysfs files creation within > the low level drivers and then call some sysfs release function inside > drm_dev_unregister > to iterate and release them all ? That would just reinvent the proper solution Greg explained above. For now I think you just need some driver callback that you call right after drm_dev_unplug (or drm_dev_unregister) to clean up these sysfs interfaces. Afaiui the important part is to clean up your additional interfaces from the ->remove callback, since at that point the core sysfs stuff still exists. Maybe you want to do another loop over all IP blocks and a ->unregister callback, or maybe it's just 1-2 cases you call directly. > > I guess aside from the split into fini_hw > > and fini_sw, you also need an unregister_late callback (like we have > > already for drm_connector, so that e.g. backlight and similar stuff > > can be unregistered). > > > Is this the callback you suggest to call from within drm_dev_unregister and > it will be responsible to release all sysfs files created within the driver ? Nah that would be an amdgpu ip block callback (forgot what it's called, too comfy to fire up an editor right now and look it up, but you have a bunch of these loops all over). I think the core solution we want is what Greg already laid out. This idea here was just an amdgpu interim plan, if the core solution is a bit too invasive to implement right away. -Daniel > > Andrey > > > > > > Papering over the underlying bug like this doesn't really fix much, > > the lifetimes are still wrong. > > -Daniel > > > > > Andrey > > > > > > > > > > > return 0; > > > > > } > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c > > > > > index 2b7c90b..54331fc 100644 > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c > > > > > @@ -24,6 +24,7 @@ > > > > > #include <linux/firmware.h> > > > > > #include <linux/slab.h> > > > > > #include <linux/module.h> > > > > > +#include <drm/drm_drv.h> > > > > > > > > > > #include "amdgpu.h" > > > > > #include "amdgpu_ucode.h" > > > > > @@ -464,7 +465,8 @@ int amdgpu_ucode_sysfs_init(struct amdgpu_device *adev) > > > > > > > > > > void amdgpu_ucode_sysfs_fini(struct amdgpu_device *adev) > > > > > { > > > > > - sysfs_remove_group(&adev->dev->kobj, &fw_attr_group); > > > > > + if (!drm_dev_is_unplugged(&adev->ddev)) > > > > > + sysfs_remove_group(&adev->dev->kobj, &fw_attr_group); > > > > > } > > > > > > > > > > static int amdgpu_ucode_init_single_fw(struct amdgpu_device *adev, > > > > > -- > > > > > 2.7.4 > > > > > > > > > -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx