Re: [PATCH v3 05/12] drm/ttm: Expose ttm_tt_unpopulate for driver use

Daniel Vetter <daniel@xxxxxxxx> · Thu, 17 Dec 2020 21:42:18 +0100

On Thu, Dec 17, 2020 at 8:19 PM Andrey Grodzovsky
<Andrey.Grodzovsky@xxxxxxx> wrote:
>
>
> On 12/17/20 7:01 AM, Daniel Vetter wrote:
> > On Wed, Dec 16, 2020 at 07:20:02PM -0500, Andrey Grodzovsky wrote:
> >> On 12/16/20 6:15 PM, Daniel Vetter wrote:
> >>> On Wed, Dec 16, 2020 at 7:26 PM Andrey Grodzovsky
> >>> <Andrey.Grodzovsky@xxxxxxx> wrote:
> >>>> On 12/16/20 12:12 PM, Daniel Vetter wrote:
> >>>>> On Wed, Dec 16, 2020 at 5:18 PM Christian König
> >>>>> <christian.koenig@xxxxxxx> wrote:
> >>>>>> Am 16.12.20 um 17:13 schrieb Andrey Grodzovsky:
> >>>>>>> On 12/16/20 9:21 AM, Daniel Vetter wrote:
> >>>>>>>> On Wed, Dec 16, 2020 at 9:04 AM Christian König
> >>>>>>>> <ckoenig.leichtzumerken@xxxxxxxxx> wrote:
> >>>>>>>>> Am 15.12.20 um 21:18 schrieb Andrey Grodzovsky:
> >>>>>>>>>> [SNIP]
> >>>>>>>>>>>> While we can't control user application accesses to the mapped
> >>>>>>>>>>>> buffers explicitly and hence we use page fault rerouting
> >>>>>>>>>>>> I am thinking that in this  case we may be able to sprinkle
> >>>>>>>>>>>> drm_dev_enter/exit in any such sensitive place were we might
> >>>>>>>>>>>> CPU access a DMA buffer from the kernel ?
> >>>>>>>>>>> Yes, I fear we are going to need that.
> >>>>>>>>>>>
> >>>>>>>>>>>> Things like CPU page table updates, ring buffer accesses and FW
> >>>>>>>>>>>> memcpy ? Is there other places ?
> >>>>>>>>>>> Puh, good question. I have no idea.
> >>>>>>>>>>>
> >>>>>>>>>>>> Another point is that at this point the driver shouldn't access any
> >>>>>>>>>>>> such buffers as we are at the process finishing the device.
> >>>>>>>>>>>> AFAIK there is no page fault mechanism for kernel mappings so I
> >>>>>>>>>>>> don't think there is anything else to do ?
> >>>>>>>>>>> Well there is a page fault handler for kernel mappings, but that one
> >>>>>>>>>>> just prints the stack trace into the system log and calls BUG(); :)
> >>>>>>>>>>>
> >>>>>>>>>>> Long story short we need to avoid any access to released pages after
> >>>>>>>>>>> unplug. No matter if it's from the kernel or userspace.
> >>>>>>>>>> I was just about to start guarding with drm_dev_enter/exit CPU
> >>>>>>>>>> accesses from kernel to GTT ot VRAM buffers but then i looked more in
> >>>>>>>>>> the code
> >>>>>>>>>> and seems like ttm_tt_unpopulate just deletes DMA mappings (for the
> >>>>>>>>>> sake of device to main memory access). Kernel page table is not
> >>>>>>>>>> touched
> >>>>>>>>>> until last bo refcount is dropped and the bo is released
> >>>>>>>>>> (ttm_bo_release->destroy->amdgpu_bo_destroy->amdgpu_bo_kunmap). This
> >>>>>>>>>> is both
> >>>>>>>>>> for GTT BOs maped to kernel by kmap (or vmap) and for VRAM BOs mapped
> >>>>>>>>>> by ioremap. So as i see it, nothing will bad will happen after we
> >>>>>>>>>> unpopulate a BO while we still try to use a kernel mapping for it,
> >>>>>>>>>> system memory pages backing GTT BOs are still mapped and not freed and
> >>>>>>>>>> for
> >>>>>>>>>> VRAM BOs same is for the IO physical ranges mapped into the kernel
> >>>>>>>>>> page table since iounmap wasn't called yet.
> >>>>>>>>> The problem is the system pages would be freed and if we kernel driver
> >>>>>>>>> still happily write to them we are pretty much busted because we write
> >>>>>>>>> to freed up memory.
> >>>>>>> OK, i see i missed ttm_tt_unpopulate->..->ttm_pool_free which will
> >>>>>>> release
> >>>>>>> the GTT BO pages. But then isn't there a problem in ttm_bo_release since
> >>>>>>> ttm_bo_cleanup_memtype_use which also leads to pages release comes
> >>>>>>> before bo->destroy which unmaps the pages from kernel page table ? Won't
> >>>>>>> we have end up writing to freed memory in this time interval ? Don't we
> >>>>>>> need to postpone pages freeing to after kernel page table unmapping ?
> >>>>>> BOs are only destroyed when there is a guarantee that nobody is
> >>>>>> accessing them any more.
> >>>>>>
> >>>>>> The problem here is that the pages as well as the VRAM can be
> >>>>>> immediately reused after the hotplug event.
> >>>>>>
> >>>>>>>> Similar for vram, if this is actual hotunplug and then replug, there's
> >>>>>>>> going to be a different device behind the same mmio bar range most
> >>>>>>>> likely (the higher bridges all this have the same windows assigned),
> >>>>>>> No idea how this actually works but if we haven't called iounmap yet
> >>>>>>> doesn't it mean that those physical ranges that are still mapped into
> >>>>>>> page
> >>>>>>> table should be reserved and cannot be reused for another
> >>>>>>> device ? As a guess, maybe another subrange from the higher bridge's
> >>>>>>> total
> >>>>>>> range will be allocated.
> >>>>>> Nope, the PCIe subsystem doesn't care about any ioremap still active for
> >>>>>> a range when it is hotplugged.
> >>>>>>
> >>>>>>>> and that's bad news if we keep using it for current drivers. So we
> >>>>>>>> really have to point all these cpu ptes to some other place.
> >>>>>>> We can't just unmap it without syncing against any in kernel accesses
> >>>>>>> to those buffers
> >>>>>>> and since page faulting technique we use for user mapped buffers seems
> >>>>>>> to not be possible
> >>>>>>> for kernel mapped buffers I am not sure how to do it gracefully...
> >>>>>> We could try to replace the kmap with a dummy page under the hood, but
> >>>>>> that is extremely tricky.
> >>>>>>
> >>>>>> Especially since BOs which are just 1 page in size could point to the
> >>>>>> linear mapping directly.
> >>>>> I think it's just more work. Essentially
> >>>>> - convert as much as possible of the kernel mappings to vmap_local,
> >>>>> which Thomas Zimmermann is rolling out. That way a dma_resv_lock will
> >>>>> serve as a barrier, and ofc any new vmap needs to fail or hand out a
> >>>>> dummy mapping.
> >>>> Read those patches. I am not sure how this helps with protecting
> >>>> against accesses to released backing pages or IO physical ranges of BO
> >>>> which is already mapped during the unplug event ?
> >>> By eliminating such users, and replacing them with local maps which
> >>> are strictly bound in how long they can exist (and hence we can
> >>> serialize against them finishing in our hotunplug code).
> >> Not sure I see how serializing against BO map/unmap helps -  our problem as
> >> you described is that once
> >> device is extracted and then something else quickly takes it's place in the
> >> PCI topology
> >> and gets assigned same physical IO ranges, then our driver will start accessing this
> >> new device because our 'zombie' BOs are still pointing to those ranges.
> > Until your driver's remove callback is finished the ranges stay reserved.
>
>
> The ranges stay reserved until unmapped which happens in bo->destroy
> which for most internally allocated  buffers is during sw_fini when last drm_put
> is called.
>
>
> > If that's not the case, then hotunplug would be fundamentally impossible
> > ot handle correctly.
> >
> > Of course all the mmio actions will time out, so it might take some time
> > to get through it all.
>
>
> I found that PCI code provides pci_device_is_present function
> we can use to avoid timeouts - it reads device vendor and checks if all 1s is
> returned
> or not. We can call it from within register accessors before trying read/write

drm_dev_enter/exit is a _lot_ less overhead, plus makes a lot stronger
guarantees for hotunplug ordering. Please use that one instead of
hand-rolling something which only mostly works for closing hotunplug
races. pciconfig access is really slow.

> >> Another point regarding serializing - problem  is that some of those BOs are
> >> very long lived, take for example the HW command
> >> ring buffer Christian mentioned before -
> >> (amdgpu_ring_init->amdgpu_bo_create_kernel), it's life span
> >> is basically for the entire time the device exists, it's destroyed only in
> >> the SW fini stage (when last drm_dev
> >> reference is dropped) and so should I grab it's dma_resv_lock from
> >> amdgpu_pci_remove code and wait
> >> for it to be unmapped before proceeding with the PCI remove code ? This can
> >> take unbound time and that why I don't understand
> >> how serializing will help.
> > Uh you need to untangle that. After hw cleanup is done no one is allowed
> > to touch that ringbuffer bo anymore from the kernel.
>
>
> I would assume we are not allowed to touch it once we identified the device is
> gone in order to minimize the chance of accidental writes to some other device
> which might now
> occupy those IO ranges ?
>
>
> >   That's what
> > drm_dev_enter/exit guards are for. Like you say we cant wait for all sw
> > references to disappear.
>
>
> Yes, didn't make sense to me why would we use vmap_local for internally
> allocated buffers. I think we should also guard registers read/writes for the
> same reason as above.
>
>
> >
> > The vmap_local is for mappings done by other drivers, through the dma-buf
> > interface (where "other drivers" can include fbdev/fbcon, if you use the
> > generic helpers).
> > -Daniel
>
>
> Ok, so I assumed that with vmap_local you were trying to solve the problem of
> quick reinsertion
> of another device into same MMIO range that my driver still points too but
> actually are you trying to solve
> the issue of exported dma buffers outliving the device ? For this we have
> drm_device refcount in the GEM layer
> i think.

That's completely different lifetime problems. Don't mix them up :-)
One problem is the hardware disappearing, and for that we _have_ to
guarantee timeliness, or otherwise the pci subsystem gets pissed
(since like you say, a new device might show up and need it's mmio
bars assigned to io ranges). The other is lifetim of the software
objects we use as interfaces, both from userspace and from other
kernel drivers. There we fundamentally can't enforce timely cleanup,
and have to resort to refcounting.

We need both.
-Daniel

> Andrey
>
>
> >
> >> Andrey
> >>
> >>
> >>> It doesn't
> >>> solve all your problems, but it's a tool to get there.
> >>> -Daniel
> >>>
> >>>> Andrey
> >>>>
> >>>>
> >>>>> - handle fbcon somehow. I think shutting it all down should work out.
> >>>>> - worst case keep the system backing storage around for shared dma-buf
> >>>>> until the other non-dynamic driver releases it. for vram we require
> >>>>> dynamic importers (and maybe it wasn't such a bright idea to allow
> >>>>> pinning of importer buffers, might need to revisit that).
> >>>>>
> >>>>> Cheers, Daniel
> >>>>>
> >>>>>> Christian.
> >>>>>>
> >>>>>>> Andrey
> >>>>>>>
> >>>>>>>
> >>>>>>>> -Daniel
> >>>>>>>>
> >>>>>>>>> Christian.
> >>>>>>>>>
> >>>>>>>>>> I loaded the driver with vm_update_mode=3
> >>>>>>>>>> meaning all VM updates done using CPU and hasn't seen any OOPs after
> >>>>>>>>>> removing the device. I guess i can test it more by allocating GTT and
> >>>>>>>>>> VRAM BOs
> >>>>>>>>>> and trying to read/write to them after device is removed.
> >>>>>>>>>>
> >>>>>>>>>> Andrey
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> Regards,
> >>>>>>>>>>> Christian.
> >>>>>>>>>>>
> >>>>>>>>>>>> Andrey
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> amd-gfx mailing list
> >>>>>>>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> >>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C92654f053679415de74808d8a2838b3e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637438033181843512%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2BeS7v5CrHRfblj2FFCd4nrDLxUxzam6EyHM6poPkGc4%3D&amp;reserved=0
> >>>>>>>>>>
> >>>

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx