How to gracefully handle pci remove

daniel@xxxxxxxx (Daniel Vetter) · Thu, 30 Aug 2018 10:42:58 +0200



On Wed, Aug 29, 2018 at 8:28 PM, Andrey Grodzovsky
<Andrey.Grodzovsky at amd.com> wrote:
> Actually, I've just spotted this drm_dev_unplug, does it make sense to use
> it in our pci_driver.remove hook
>
> instead of explicitly doing drm_dev_unregister and drm_dev_put(dev) ?
>
> This way at least any following IOCTL will fail with ENODEV.

Definitely. The problem is still that the refcounting beyond the
drm_device is totally screwed up, and your kernel will Oops
eventually.
-Daniel

>
> Andrey
>
>
> On 08/29/2018 11:07 AM, Daniel Vetter wrote:
>>
>> On Wed, Aug 29, 2018 at 4:43 PM, Andrey Grodzovsky
>> <Andrey.Grodzovsky at amd.com> wrote:
>>>
>>> Just another ping...
>>>
>>> Daniel, Dave - maybe you could give some advise on that ?
>>>
>>> P.S I tried with Intel card (i915) driver on 4.18.1 kernel to do the same
>>> to
>>> get some reference point, but it just hanged.
>>
>> drm_device hot-unplug is defacto unsolved. We've only just started to
>> fix the most obvious races around the refcounting of drm_device
>> it'self, see the work from Noralf Tronnes around drm_dev_get/put.
>>
>> No one has even started to think about what it would take to correctly
>> refcount a full-blown memory manager to handle hotunplug. I'd expect
>> lots of nightmares. The real horror is that it's not just the
>> drm_device, but also lots of things we're exporting: dma_buf,
>> dma_fence, ... All of that must be handled one way or the other.
>>
>> So expect your kernel to Oops when you unplug a device.
>>
>> Wrt userspace handling this: Probably an even bigger question. No
>> idea, and will depend upon what userspace you're running.
>> -Daniel
>>
>>> Andrey
>>>
>>>
>>>
>>>
>>> On 08/27/2018 12:04 PM, Andrey Grodzovsky wrote:
>>>>
>>>> Hi everybody , I am trying to resolve various problems I observe when
>>>> logically removing AMDGPU device from pci - echo 1 >
>>>> /sys/class/drm/card0/device/remove
>>>>
>>>> One of the problems I encountered was hitting WARNs  in
>>>> amdgpu_gem_force_release. It complaints  about still open client FDs and
>>>> BOs
>>>> allocations which is obvious since
>>>>
>>>> we didn't let user space clients know about the device removal and hence
>>>> they won't release allocations and won't close their FDs.
>>>>
>>>> Question - how other drivers handle this use case, especially eGPUs
>>>> since
>>>> they indeed may be extracted in any moment, is there any way to notify
>>>> Xorg
>>>> and other clients about this so they may
>>>>
>>>> have a chance to release all their allocations and probably terminate ?
>>>> Maybe some kind of uevent ?
>>>>
>>>> Andrey
>>>>
>>
>>
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch