On 2/5/21 4:35 PM, Keith Busch wrote:
On Fri, Feb 05, 2021 at 03:42:01PM -0500, Andrey Grodzovsky wrote:
On 2/5/21 2:45 PM, Bjorn Helgaas wrote:
On Fri, Feb 05, 2021 at 11:08:45AM -0500, Andrey Grodzovsky wrote:
For user mappings, including MMIO mappings, we have a reliable
approach where we invalidate device address space mappings for all
user on first sign of device disconnect and then on all subsequent
page faults from the users accessing those ranges we insert dummy
zero page into their respective page tables. It's actually the
kernel driver, where no page faulting can be used such as for user
space, I have issues on how to protect from keep accessing those
ranges which already are released by PCI subsystem and hence can be
allocated to another hot plugging device.
That doesn't sound reliable to me, but maybe I don't understand what
you mean by the "first sign of device disconnect."
See functions drm_dev_enter, drm_dev_exit and drm_dev_unplug in drm_derv.c
At least from a PCI
perspective, the first sign of a surprise hot unplug is likely to be
an MMIO read that returns ~0.
We set drm_dev_unplug in amdgpu_pci_remove and base all later checks
with drm_dev_enter/drm_dev_exit on this
It sounds like you are talking about an orderly notified unplug rather
than a surprise hot unplug. If it's a surprise, the code doesn't get to
fence off future MMIO access until well after the address range is
already unreachable.
I am referring to surprise unplug on which we get notification from the PCI
subsystem which ends up calling our pci_driver.remove callback. I understand
there is a window of time within we are not yet notified but all our MMIO
accesses will already fail because the device is physically gone at that point
already.
Andrey