On Fri, Feb 5, 2021 at 10:24 PM Andrey Grodzovsky <Andrey.Grodzovsky@xxxxxxx> wrote: > > > > On 2/5/21 3:49 PM, Daniel Vetter wrote: > > On Fri, Feb 5, 2021 at 9:42 PM Andrey Grodzovsky > > <Andrey.Grodzovsky@xxxxxxx> wrote: > >> > >> > >> > >> On 2/5/21 2:45 PM, Bjorn Helgaas wrote: > >>> On Fri, Feb 05, 2021 at 11:08:45AM -0500, Andrey Grodzovsky wrote: > >>>> On 2/5/21 10:38 AM, Bjorn Helgaas wrote: > >>>>> On Thu, Feb 04, 2021 at 11:03:10AM -0500, Andrey Grodzovsky wrote: > >>>>>> + linux-pci mailing list and a bit of extra details bellow. > >>>>>> > >>>>>> On 2/2/21 12:51 PM, Andrey Grodzovsky wrote: > >>>>>>> Bjorn, Sergey I would also like to use this opportunity to ask you a > >>>>>>> question on a related topic - Hot unplug. > >>>>>>> I've been working for a while on this (see latest patchset set here > >>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Farchives%2Famd-gfx%2F2021-January%2F058595.html&data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C49d227022bff486d9fd008d8ca178ec1%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637481549831907816%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=I0ma8E5vOKAOh3vPI0IcGrbZpeFeHtbjuBOkTA5mPSU%3D&reserved=0). > >>>>>>> My question is specifically regarding this patch > >>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Farchives%2Famd-gfx%2F2021-January%2F058606.html&data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C49d227022bff486d9fd008d8ca178ec1%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637481549831917820%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=AS4ahmsk%2BY1bRvkqrLWC1KfrE8oNAGhbtKqIOm5AUQo%3D&reserved=0 > >>>>>>> - the idea here is to > >>>>>>> prevent any accesses to MMIO address ranges still mapped in kernel page > >>>>> > >>>>> I think you're talking about a PCI BAR that is mapped into a user > >>>>> process. > >>>> > >>>> For user mappings, including MMIO mappings, we have a reliable > >>>> approach where we invalidate device address space mappings for all > >>>> user on first sign of device disconnect and then on all subsequent > >>>> page faults from the users accessing those ranges we insert dummy > >>>> zero page into their respective page tables. It's actually the > >>>> kernel driver, where no page faulting can be used such as for user > >>>> space, I have issues on how to protect from keep accessing those > >>>> ranges which already are released by PCI subsystem and hence can be > >>>> allocated to another hot plugging device. > >>> > >>> That doesn't sound reliable to me, but maybe I don't understand what > >>> you mean by the "first sign of device disconnect." > >> > >> See functions drm_dev_enter, drm_dev_exit and drm_dev_unplug in drm_derv.c > >> > >> At least from a PCI > >>> perspective, the first sign of a surprise hot unplug is likely to be > >>> an MMIO read that returns ~0. > >> > >> We set drm_dev_unplug in amdgpu_pci_remove and base all later checks > >> with drm_dev_enter/drm_dev_exit on this > >> > >>> > >>> It's true that the hot unplug will likely cause an interrupt and we > >>> *may* be able to unbind the driver before the driver or a user program > >>> performs an MMIO access, but there's certainly no guarantee. The > >>> following sequence is always possible: > >>> > >>> - User unplugs PCIe device > >>> - Bridge raises hotplug interrupt > >>> - Driver or userspace issues MMIO read > >>> - Bridge responds with error because link to device is down > >>> - Host bridge receives error, fabricates ~0 data to CPU > >>> - Driver or userspace sees ~0 data from MMIO read > >>> - PCI core fields hotplug interrupt and unbinds driver > >>> > >>>>> It is impossible to reliably *prevent* MMIO accesses to a BAR on a > >>>>> PCI device that has been unplugged. There is always a window > >>>>> where the CPU issues a load/store and the device is unplugged > >>>>> before the load/store completes. > >>>>> > >>>>> If a PCIe device is unplugged, an MMIO read to that BAR will > >>>>> complete on PCIe with an uncorrectable fatal error. When that > >>>>> happens there is no valid data from the PCIe device, so the PCIe > >>>>> host bridge typically fabricates ~0 data so the CPU load > >>>>> instruction can complete. > >>>>> > >>>>> If you want to reliably recover from unplugs like this, I think > >>>>> you have to check for that ~0 data at the important points, i.e., > >>>>> where you make decisions based on the data. Of course, ~0 may be > >>>>> valid data in some cases. You have to use your knowledge of the > >>>>> device programming model to determine whether ~0 is possible, and > >>>>> if it is, check in some other way, like another MMIO read, to tell > >>>>> whether the read succeeded and returned ~0, or failed because of > >>>>> PCIe error and returned ~0. > >>>> > >>>> Looks like there is a high performance price to pay for this > >>>> approach if we protect at every possible junction (e.g. register > >>>> read/write accessors ), I tested this by doing 1M read/writes while > >>>> using drm_dev_enter/drm_dev_exit which is DRM's RCU based guard > >>>> against device unplug and even then we hit performance penalty of > >>>> 40%. I assume that with actual MMIO read (e.g. > >>>> pci_device_is_present) will cause a much larger performance > >>>> penalty. > >>> > >>> I guess you have to decide whether you want a fast 90% solution or a > >>> somewhat slower 100% reliable solution :) > >>> > >>> I don't think the checking should be as expensive as you're thinking. > >>> You only have to check if: > >>> > >>> (1) you're doing an MMIO read (there's no response for MMIO writes, > >>> so you can't tell whether they succeed), > >>> (2) the MMIO read returns ~0, > >>> (3) ~0 might be a valid value for the register you're reading, and > >>> (4) the read value is important to your control flow. > >>> > >>> For example, if you do a series of reads and act on the data after all > >>> the reads complete, you only need to worry about whether the *last* > >>> read failed. If that last read is to a register that is known to > >>> contain a zero bit, no additional MMIO read is required and the > >>> checking is basically free. > >> > >> I am more worried about MMIO writes to avoid writing to a BAR > >> of a newly 'just' plugged in device that got accidentally allocated some > >> part of MMIO addresses range that our 'ghost' device still using. > > > > We have to shut all these down before the ->remove callback has > > finished. At least that's my understanding, before that's all done the > > pci subsystem wont reallocate anything. > > > > The drm_dev_enter/exit is only meant to lock out entire code paths > > when we know the device is gone, so that we don't spend an eternity > > timing out every access and running lots of code for no point. The > > other reason for drm_dev_exit is to guard the sw side from access of > > data structures which we tear down (or have to tear down) as part of > > ->remove callback, like io remaps, maybe timers/workers driving hw > > access and all these things. Only the uapi objects need to stay around > > until userspace has closed all the fd/unmapped all the mappings so > > that we don't oops all over the place. > > > > Of course auditing a big driver like amdgpu for this is massive pain, > > so where/how you sprinkle drm_dev_enter/exit over it is a bit an > > engineering tradeoff. > > > > Either way (if my understanding is correct), after ->remove is > > finished, your driver _must_ guarantee that all access to mmio ranges > > has stopped. How you do that is kinda up to you. > > > > But also like Bjorn pointed out, _while_ the hotunplug is happening, > > i.e. at any moment between when transactions start failing up to and > > including you driver's ->remove callback having finished, you need to > > be able to cope with the bogus all-0xff replies in your code. But > > that's kinda an orthogonal requirement: If you've fixed the use-after > > free you might still crash on 0xff, and even if you fixed the all the > > crashes due to 0xff reads you might still have lots of use-after frees > > around. > > > > Cheers, Daniel > > I see, so, to summarize and confirm I got this correct. > We drop the per register access guards as this an overkill and hurts performance. > Post .remove callback I verify to guard any IOCTL (already done), block GPU > scheduler threads (already done) and disable interrupts (already done), guard > against any direct IB submissions to HW rings (not done yet) and cancel and > flush any background running threads (timers mostly probably) (not done yet) > which might trigger HW programming. > > Correct ? Yeah. Well for clean design I think your ->remove callback should guarantee that all background tasks your driver has (timers, workers, whatever) are completely stopped. With drm_dev_enter/exit maybe only guarding re-arming of these (since if you stop the timer/kthread maybe also the data structure is gone entirely with kfree() that contained these). But that's also implementation details which might go one way or the other. -Daniel > > Andrey > > > > >> Andrey > >> > >>> > >>>>>>> table by the driver AFTER the device is gone physically and > >>>>>>> from the PCI subsystem, post pci_driver.remove call back > >>>>>>> execution. This happens because struct device (and struct > >>>>>>> drm_device representing the graphic card) are still present > >>>>>>> because some user clients which are not aware of hot removal > >>>>>>> still hold device file open and hence prevents device refcount > >>>>>>> to drop to 0. The solution in this patch is brute force where > >>>>>>> we try and find any place we access MMIO mapped to kernel > >>>>>>> address space and guard against the write access with a > >>>>>>> 'device-unplug' flag. This solution is obliviously racy > >>>>>>> because a device can be unplugged right after checking the > >>>>>>> falg. I had an idea to instead not release but to keep those > >>>>>>> ranges reserved even after pci_driver.remove, until DRM device > >>>>>>> is destroyed when it's refcount drops to 0 and by this to > >>>>>>> prevent new devices plugged in and allocated some of the same > >>>>>>> MMIO address range to get accidental writes into their > >>>>>>> registers. But, on dri-devel I was advised that this will > >>>>>>> upset the PCI subsystem and so best to be avoided but I still > >>>>>>> would like another opinion from PCI experts on whether this > >>>>>>> approach is indeed not possible ? > > > > > > -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch