On Tue, May 02, 2023 at 01:21:43PM -0400, Rodrigo Vivi wrote: > On Tue, May 02, 2023 at 03:40:50PM +0000, Matthew Brost wrote: > > On Wed, Apr 26, 2023 at 04:57:02PM -0400, Rodrigo Vivi wrote: > > > Unfortunately devcoredump infrastructure does not provide and > > > interface for us to force the device removal upon the pci_remove > > > time of our device. > > > > > > The devcoredump is linked at the device level, so when in use > > > it will prevent the module removal, but it doesn't prevent the > > > call of the pci_remove callback. This callback cannot fail > > > anyway and we end up clearing and freeing the entire pci device. > > > > > > Hence, after we removed the pci device, we shouldn't allow any > > > read or free operations to avoid segmentation fault. > > > > > > Signed-off-by: Rodrigo Vivi <rodrigo.vivi@xxxxxxxxx> > > > --- > > > drivers/gpu/drm/xe/xe_devcoredump.c | 19 ++++++++++++++++--- > > > 1 file changed, 16 insertions(+), 3 deletions(-) > > > > > > diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c b/drivers/gpu/drm/xe/xe_devcoredump.c > > > index d9531183f03a..a08929c01b75 100644 > > > --- a/drivers/gpu/drm/xe/xe_devcoredump.c > > > +++ b/drivers/gpu/drm/xe/xe_devcoredump.c > > > @@ -42,6 +42,11 @@ > > > * hang capture. > > > */ > > > > > > +static struct xe_device *coredump_to_xe(const struct xe_devcoredump *coredump) > > > +{ > > > + return container_of(coredump, struct xe_device, devcoredump); > > > > Confused how still would ever return NULL, can you explain? > > Very good question! I'm honestly still confused myself. > > There's something not quite right with the device relationship that > is getting created with the failling_device and the virtual coredump device. > > Once failing_device is removed, the devcoredump should be removed as well, > or both removals blocked. However this is not what happens. > > On rmmod xe, the device removal is called and we free all xe structs. > The pci device removal is a void function. We cannot fail. The module > removal ends up blocked because the relationship, but that doesn't > saves the day since the device itself is already gone, by the pci > removal function. > > But the devcoredump device is there and active. There's no callback on > devcoredump infra that we could call to force the device removal. Then > any read function will hit a NULL xe device and BOOM! > > This is one of the things I planned to tackle on the devcoredump side > after we get this basic infra in use in our driver. This patch allows > us to be protected from this scenario while we don't fix this at the > devcoredump side. > > It's worth saying that the devcoredump virtual device is auto removed > after a time elapsed... couple minutes? (I can't remember by heart now). > After some serious thought, I think I get this. With that: Reviewed-by: Matthew Brost <matthew.brost@xxxxxxxxx> But agree this goofy and try to fix this properly after this is merged. > > > > Matt > > > > > +} > > > + > > > static ssize_t xe_devcoredump_read(char *buffer, loff_t offset, > > > size_t count, void *data, size_t datalen) > > > { > > > @@ -51,6 +56,10 @@ static ssize_t xe_devcoredump_read(char *buffer, loff_t offset, > > > struct drm_print_iterator iter; > > > struct timespec64 ts; > > > > > > + /* Our device is gone already... */ > > > + if (!data || !coredump_to_xe(coredump)) > > > + return -ENODEV; > > > + > > > iter.data = buffer; > > > iter.offset = 0; > > > iter.start = offset; > > > @@ -80,12 +89,16 @@ static ssize_t xe_devcoredump_read(char *buffer, loff_t offset, > > > static void xe_devcoredump_free(void *data) > > > { > > > struct xe_devcoredump *coredump = data; > > > - struct xe_device *xe = container_of(coredump, struct xe_device, > > > - devcoredump); > > > + > > > + /* Our device is gone. Nothing to do... */ > > > + if (!data || !coredump_to_xe(coredump)) > > > + return; > > > + > > > mutex_lock(&coredump->lock); > > > > > > coredump->faulty_engine = NULL; > > > - drm_info(&xe->drm, "Xe device coredump has been deleted.\n"); > > > + drm_info(&coredump_to_xe(coredump)->drm, > > > + "Xe device coredump has been deleted.\n"); > > > > > > mutex_unlock(&coredump->lock); > > > } > > > -- > > > 2.39.2 > > >