Re: [PATCH 0/6] RFC Support hot device unplug in amdgpu

Andrey Grodzovsky <Andrey.Grodzovsky@xxxxxxx> · Mon, 11 May 2020 12:37:40 -0400

    On 5/11/20 5:26 AM, Pekka Paalanen
      wrote:

      On Sat, 9 May 2020 14:51:44 -0400
Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> wrote:

        This RFC is a more of a proof of concept then a fully working
solution as there are a few unresolved issues we are hopping to get
advise on from people on the mailing list. Until now extracting a
card either by physical extraction (e.g. eGPU with thunderbold
connection or by emulation through syfs
-> /sys/bus/pci/devices/device_id/remove) would cause random crashes
in user apps. The random crashes in apps were mostly due to the app
having mapped a device backed BO into it's adress space was still
trying to access the BO while the backing device was gone. To answer
this first problem Christian suggested to fix the handling of mapped
memory in the clients when the device goes away by forcibly unmap all
buffers the user processes has by clearing their respective VMAs
mapping the device BOs. Then when the VMAs try to fill in the page
tables again we check in the fault handler if the device is removed
and if so, return an error. This will generate a SIGBUS to the
application which can then cleanly terminate. This indeed was done
but this in turn created a problem of kernel OOPs were the OOPSes
were due to the fact that while the app was terminating because of
the SIGBUS it would trigger use after free in the driver by calling
to accesses device structures that were already released from the pci
remove sequence. This we handled by introducing a 'flush' seqence
during device removal were we wait for drm file reference to drop to
0 meaning all user clients directly using this device terminated.
With this I was able to cleanly emulate device unplug with X and
glxgears running and later emulate device plug back and restart of X
and glxgears.

But this use case is only partial and as I see it all the use cases
are as follwing and the questions it raises.

1) Application accesses a BO by opening drm file
	1.1) BO is mapped into applications address space (BO is CPU visible) - this one we have a solution for by invaldating BO's CPU mapping casuing SIGBUS 
	     and termination and waiting for drm file refcound to drop to 0 before releasing the device
	1.2) BO is not mapped into applcation address space (BO is CPU invisible) - no solution yet because how we force the application to terminate in this case ?

2) Application accesses a BO by importing a DMA-BUF
	2.1)  BO is mapped into applications address space (BO is CPU visible) - solution is same as 1.1 but instead of waiting for drm file release we wait for the 
	      imported dma-buf's file release
	2.2)  BO is not mapped into applcation address space (BO is CPU invisible) - our solution is to invalidate GPUVM page tables and destroy backing storage for 
              all exported BOs which will in turn casue VM faults in the importing device and then when the importing driver will try to re-attach the imported BO to 
	      update mappings we return -ENODEV in the import hook which hopeffuly will cause the user app to terminate.

3) Applcation opens a drm file or imports a dma-bud and holds a reference but never access any BO or does access but never more after device was unplug - how would we 
   force this applcation to termiante before proceeding with device removal code ? Otherwise the wait in pci remove just hangs for ever.

The attached patches adress 1.1, 2.1 and 2.2, for now only 1.1 fully tested and I am still testing the others but I will be happy for any advise on all the 
described use cases and maybe some alternative and better (more generic) approach to this like maybe obtaining PIDs of relevant processes through some revere 
mapping from device file and exported dma-buf files and send them SIGKILL - would this make more sense or any other method ? 

Patches 1-3 address 1.1
Patch 4 addresses 2.1
Pathces 5-6 address 2.2

Reference: https://nam11.safelinks.protection.outlook.com/?url="">

      Hi,

how did you come up with the goal "make applications terminate"? Is
that your end goal, or is it just step 1 of many on the road of
supporting device hot-unplug?

    Just as an effort to improve the current situation where we have
      unexpected random crashes following device removal.

Why do you want to terminate also applications that don't "need" to
terminate? Why hunt them down? I'm referring to your points 1.2, 2.2
and 3.

    Because when those applications do exit and since they hold a
      reference to drm device through their open device file descriptor
      or dma-buf file descriptor we end up in use after free situation
      where during pci remove we already released everything but now
      last drm_dev_put release callback is trying to access those
      released structures. Any way, as you and Daniel pointed out
      forcing termination is a bad approach.  Seems we need to actually
      keep all the drm structures around until the very last device
      reference is dropped while in the meatime returning error code for
      any new IOCTLs and rerouting any  page fault to zero page. 

    Thanks  for the detailed response. 

    Andrey

>From an end user perspective, I believe making applications terminate
is not helpful at all. Your display server still disappears, which
means all your apps are forced to quit, and you lose your desktop. I do
understand that a graceful termination is better than a hard lockup,
but not much.

When I've talked about DRM device hot-unplug with Daniel Vetter, our
shared opinion seems to be that the unplug should not outright kill any
programs that are prepared to handle errors, that is, functions or
ioctls that return a success code can return an error, and then it is
up for the application to decide how to handle that. The end goal must
not be to terminate all applications that had something to do with the
device. At the very least the display server must survive.

The rough idea on how that should work is that DRM ioctls start
returning errors and all mmaps are replaced with something harmless
that does not cause a SIGBUS. Userspace can handle the errors if it
wants to, and display servers will react to the device removed uevent
if not earlier.

Why deliberately avoid raising SIGBUS? Because it is such a huge pain
to handle due to the archaic design of how signals are delivered. Most
of existing userspace is also not prepared to handle SIGBUS anywhere.

The problem of handling SIGBUS at all is that a process can only have a
single signal handler per signal, but the process may comprise of
multiple components that cannot cooperate on signal catching: Mesa GPU
drivers, GUI toolkits, and the application itself may all do some
things that would require handling SIGBUS if removing a DRM device
raised it. For Mesa to cooperate with SIGBUS handling with the other
components in the process, we'd need some whole new APIs, an EGL
extension and maybe Vulkan extension too. The process may also have
threads, which are really painful with signals. What if you need to
handle the SIGBUS differently in different threads?

Hence, mmaps should be replaced with something harmless, maybe
something that reads back all zeros and ignores writes. The application
will learn later that the DRM device is gone. Sending it a SIGBUS on
the spot when it accesses an mmap does not help: the memory is gone
already - if you didn't have a backup of the contents, you're not going
to make one now.

My point here is, are you designing things to specifically only
terminate processes, or will you leave room in the design to improve the
implementation towards a proper handling of DRM device hot-unplug?

Thanks,
pq

        Andrey Grodzovsky (6):
  drm/ttm: Add unampping of the entire device address space
  drm/amdgpu: Force unmap all user VMAs on device removal.
  drm/amdgpu: Wait for all user clients
  drm/amdgpu: Wait for all clients importing out dma-bufs.
  drm/ttm: Add destroy flag in TTM BO eviction interface
  drm/amdgpu: Use TTM MMs destroy interface

 drivers/gpu/drm/amd/amdgpu/amdgpu.h         |  3 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  7 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 27 ++++++++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c     | 22 ++++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c     |  9 +++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c     |  4 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c  | 17 +++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h  |  1 +
 drivers/gpu/drm/nouveau/nouveau_drm.c       |  2 +-
 drivers/gpu/drm/qxl/qxl_object.c            |  4 +-
 drivers/gpu/drm/radeon/radeon_object.c      |  2 +-
 drivers/gpu/drm/ttm/ttm_bo.c                | 63 +++++++++++++++++++++--------
 drivers/gpu/drm/vmwgfx/vmwgfx_drv.c         |  6 +--
 include/drm/ttm/ttm_bo_api.h                |  2 +-
 include/drm/ttm/ttm_bo_driver.h             |  2 +
 16 files changed, 139 insertions(+), 34 deletions(-)

      _______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://nam11.safelinks.protection.outlook.com/?url="">

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel