Re: Avoid MMIO write access after hot unplug [WAS - Re: Question about supporting AMD eGPU hot plug case]

Keith Busch <kbusch@xxxxxxxxxx> · Fri, 5 Feb 2021 13:35:34 -0800

On Fri, Feb 05, 2021 at 03:42:01PM -0500, Andrey Grodzovsky wrote:
> On 2/5/21 2:45 PM, Bjorn Helgaas wrote:
> > On Fri, Feb 05, 2021 at 11:08:45AM -0500, Andrey Grodzovsky wrote:
> > > 
> > > For user mappings, including MMIO mappings, we have a reliable
> > > approach where we invalidate device address space mappings for all
> > > user on first sign of device disconnect and then on all subsequent
> > > page faults from the users accessing those ranges we insert dummy
> > > zero page into their respective page tables. It's actually the
> > > kernel driver, where no page faulting can be used such as for user
> > > space, I have issues on how to protect from keep accessing those
> > > ranges which already are released by PCI subsystem and hence can be
> > > allocated to another hot plugging device.
> > 
> > That doesn't sound reliable to me, but maybe I don't understand what
> > you mean by the "first sign of device disconnect."
> 
> See functions drm_dev_enter, drm_dev_exit and drm_dev_unplug in drm_derv.c
> 
> > At least from a PCI
> > perspective, the first sign of a surprise hot unplug is likely to be
> > an MMIO read that returns ~0.
> 
> We set drm_dev_unplug in amdgpu_pci_remove and base all later checks
> with drm_dev_enter/drm_dev_exit on this

It sounds like you are talking about an orderly notified unplug rather
than a surprise hot unplug. If it's a surprise, the code doesn't get to
fence off future MMIO access until well after the address range is
already unreachable.