On Wed, 10 Nov 2021 22:03:07 +0100, Kai Vehmanen wrote: > > Fix a corner case between PCI device driver remove callback and > runtime PM idle callback. > > Following sequence of events can happen: > - at azx_create, context is allocated with devm_kzalloc() and > stored as pci_set_drvdata() > - user-space requests to unbind audio driver > - dd.c:__device_release_driver() calls PCI remove > - pci-driver.c:pci_device_remove() calls the audio > driver azx_remove() callback and this is completed > - pci-driver.c:pm_runtime_put_sync() leads to a call > to rpm_idle() which again calls azx_runtime_idle() > - the azx context object, as returned by dev_get_drvdata(), > is no longer valid > -> access fault in azx_runtime_idle when executing > struct snd_card *card = dev_get_drvdata(dev); > chip = card->private_data; > if (chip->disabled || hda->init_failed) > > This was discovered by i915_module_load test with 5.15.0 based > linux-next tree. > > Example log caught by i915_module_load test with linux-next > https://intel-gfx-ci.01.org/tree/linux-next/ > > <4> [264.038232] general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b73f0: 0000 [#1] PREEMPT SMP NOPTI > <4> [264.038248] CPU: 0 PID: 5374 Comm: i915_module_loa Not tainted 5.15.0-next-20211109-gc8109c2ba35e-next-20211109 #1 > [...] > <4> [264.038267] RIP: 0010:azx_runtime_idle+0x12/0x60 [snd_hda_intel] > [...] > <4> [264.038355] Call Trace: > <4> [264.038359] <TASK> > <4> [264.038362] __rpm_callback+0x3d/0x110 > <4> [264.038371] rpm_idle+0x27f/0x380 > <4> [264.038376] __pm_runtime_idle+0x3b/0x100 > <4> [264.038382] pci_device_remove+0x6d/0xa0 > <4> [264.038388] device_release_driver_internal+0xef/0x1e0 > <4> [264.038395] unbind_store+0xeb/0x120 > <4> [264.038400] kernfs_fop_write_iter+0x11a/0x1c0 > > Fix the issue by setting drvdata to NULL at end of azx_remove(). > > Signed-off-by: Kai Vehmanen <kai.vehmanen@xxxxxxxxxxxxxxx> > --- > sound/pci/hda/hda_intel.c | 1 + > 1 file changed, 1 insertion(+) > > Some non-persistent direct links showing the bug trigger on > different platforms with linux-next 20211109: > - https://intel-gfx-ci.01.org/tree/linux-next/next-20211109/fi-tgl-1115g4/igt@i915_module_load@xxxxxxxxxxx > - https://intel-gfx-ci.01.org/tree/linux-next/next-20211109/fi-jsl-1/igt@i915_module_load@xxxxxxxxxxx > > Notably with 20211110 linux-next, the bug does not trigger: > - https://intel-gfx-ci.01.org/tree/linux-next/next-20211110/fi-tgl-1115g4/igt@i915_module_load@xxxxxxxxxxx Is this the case with CONFIG_DEBUG_KOBJECT_RELEASE? This would be the only logical explanation I can think of for now. In anyway, the code change itself looks good, so I took the fix now. thanks, Takashi