How to gracefully handle pci remove

Andrey.Grodzovsky@xxxxxxx (Andrey Grodzovsky) · Wed, 29 Aug 2018 11:32:41 -0400

Thanks.

Andrey

On 08/29/2018 11:07 AM, Daniel Vetter wrote:
> On Wed, Aug 29, 2018 at 4:43 PM, Andrey Grodzovsky
> <Andrey.Grodzovsky at amd.com> wrote:
>> Just another ping...
>>
>> Daniel, Dave - maybe you could give some advise on that ?
>>
>> P.S I tried with Intel card (i915) driver on 4.18.1 kernel to do the same to
>> get some reference point, but it just hanged.
> drm_device hot-unplug is defacto unsolved. We've only just started to
> fix the most obvious races around the refcounting of drm_device
> it'self, see the work from Noralf Tronnes around drm_dev_get/put.
>
> No one has even started to think about what it would take to correctly
> refcount a full-blown memory manager to handle hotunplug. I'd expect
> lots of nightmares. The real horror is that it's not just the
> drm_device, but also lots of things we're exporting: dma_buf,
> dma_fence, ... All of that must be handled one way or the other.
>
> So expect your kernel to Oops when you unplug a device.
>
> Wrt userspace handling this: Probably an even bigger question. No
> idea, and will depend upon what userspace you're running.
> -Daniel
>
>> Andrey
>>
>>
>>
>>
>> On 08/27/2018 12:04 PM, Andrey Grodzovsky wrote:
>>> Hi everybody , I am trying to resolve various problems I observe when
>>> logically removing AMDGPU device from pci - echo 1 >
>>> /sys/class/drm/card0/device/remove
>>>
>>> One of the problems I encountered was hitting WARNs  in
>>> amdgpu_gem_force_release. It complaints  about still open client FDs and BOs
>>> allocations which is obvious since
>>>
>>> we didn't let user space clients know about the device removal and hence
>>> they won't release allocations and won't close their FDs.
>>>
>>> Question - how other drivers handle this use case, especially eGPUs since
>>> they indeed may be extracted in any moment, is there any way to notify Xorg
>>> and other clients about this so they may
>>>
>>> have a chance to release all their allocations and probably terminate ?
>>> Maybe some kind of uevent ?
>>>
>>> Andrey
>>>
>
>