On Wed, Dec 16, 2020 at 5:18 PM Christian König <christian.koenig@xxxxxxx> wrote: > > Am 16.12.20 um 17:13 schrieb Andrey Grodzovsky: > > > > On 12/16/20 9:21 AM, Daniel Vetter wrote: > >> On Wed, Dec 16, 2020 at 9:04 AM Christian König > >> <ckoenig.leichtzumerken@xxxxxxxxx> wrote: > >>> Am 15.12.20 um 21:18 schrieb Andrey Grodzovsky: > >>>> [SNIP] > >>>>>> While we can't control user application accesses to the mapped > >>>>>> buffers explicitly and hence we use page fault rerouting > >>>>>> I am thinking that in this case we may be able to sprinkle > >>>>>> drm_dev_enter/exit in any such sensitive place were we might > >>>>>> CPU access a DMA buffer from the kernel ? > >>>>> Yes, I fear we are going to need that. > >>>>> > >>>>>> Things like CPU page table updates, ring buffer accesses and FW > >>>>>> memcpy ? Is there other places ? > >>>>> Puh, good question. I have no idea. > >>>>> > >>>>>> Another point is that at this point the driver shouldn't access any > >>>>>> such buffers as we are at the process finishing the device. > >>>>>> AFAIK there is no page fault mechanism for kernel mappings so I > >>>>>> don't think there is anything else to do ? > >>>>> Well there is a page fault handler for kernel mappings, but that one > >>>>> just prints the stack trace into the system log and calls BUG(); :) > >>>>> > >>>>> Long story short we need to avoid any access to released pages after > >>>>> unplug. No matter if it's from the kernel or userspace. > >>>> > >>>> I was just about to start guarding with drm_dev_enter/exit CPU > >>>> accesses from kernel to GTT ot VRAM buffers but then i looked more in > >>>> the code > >>>> and seems like ttm_tt_unpopulate just deletes DMA mappings (for the > >>>> sake of device to main memory access). Kernel page table is not > >>>> touched > >>>> until last bo refcount is dropped and the bo is released > >>>> (ttm_bo_release->destroy->amdgpu_bo_destroy->amdgpu_bo_kunmap). This > >>>> is both > >>>> for GTT BOs maped to kernel by kmap (or vmap) and for VRAM BOs mapped > >>>> by ioremap. So as i see it, nothing will bad will happen after we > >>>> unpopulate a BO while we still try to use a kernel mapping for it, > >>>> system memory pages backing GTT BOs are still mapped and not freed and > >>>> for > >>>> VRAM BOs same is for the IO physical ranges mapped into the kernel > >>>> page table since iounmap wasn't called yet. > >>> The problem is the system pages would be freed and if we kernel driver > >>> still happily write to them we are pretty much busted because we write > >>> to freed up memory. > > > > > > OK, i see i missed ttm_tt_unpopulate->..->ttm_pool_free which will > > release > > the GTT BO pages. But then isn't there a problem in ttm_bo_release since > > ttm_bo_cleanup_memtype_use which also leads to pages release comes > > before bo->destroy which unmaps the pages from kernel page table ? Won't > > we have end up writing to freed memory in this time interval ? Don't we > > need to postpone pages freeing to after kernel page table unmapping ? > > BOs are only destroyed when there is a guarantee that nobody is > accessing them any more. > > The problem here is that the pages as well as the VRAM can be > immediately reused after the hotplug event. > > > > > > >> Similar for vram, if this is actual hotunplug and then replug, there's > >> going to be a different device behind the same mmio bar range most > >> likely (the higher bridges all this have the same windows assigned), > > > > > > No idea how this actually works but if we haven't called iounmap yet > > doesn't it mean that those physical ranges that are still mapped into > > page > > table should be reserved and cannot be reused for another > > device ? As a guess, maybe another subrange from the higher bridge's > > total > > range will be allocated. > > Nope, the PCIe subsystem doesn't care about any ioremap still active for > a range when it is hotplugged. > > > > >> and that's bad news if we keep using it for current drivers. So we > >> really have to point all these cpu ptes to some other place. > > > > > > We can't just unmap it without syncing against any in kernel accesses > > to those buffers > > and since page faulting technique we use for user mapped buffers seems > > to not be possible > > for kernel mapped buffers I am not sure how to do it gracefully... > > We could try to replace the kmap with a dummy page under the hood, but > that is extremely tricky. > > Especially since BOs which are just 1 page in size could point to the > linear mapping directly. I think it's just more work. Essentially - convert as much as possible of the kernel mappings to vmap_local, which Thomas Zimmermann is rolling out. That way a dma_resv_lock will serve as a barrier, and ofc any new vmap needs to fail or hand out a dummy mapping. - handle fbcon somehow. I think shutting it all down should work out. - worst case keep the system backing storage around for shared dma-buf until the other non-dynamic driver releases it. for vram we require dynamic importers (and maybe it wasn't such a bright idea to allow pinning of importer buffers, might need to revisit that). Cheers, Daniel > > Christian. > > > > > Andrey > > > > > >> -Daniel > >> > >>> Christian. > >>> > >>>> I loaded the driver with vm_update_mode=3 > >>>> meaning all VM updates done using CPU and hasn't seen any OOPs after > >>>> removing the device. I guess i can test it more by allocating GTT and > >>>> VRAM BOs > >>>> and trying to read/write to them after device is removed. > >>>> > >>>> Andrey > >>>> > >>>> > >>>>> Regards, > >>>>> Christian. > >>>>> > >>>>>> Andrey > >>>>> > >>>> _______________________________________________ > >>>> amd-gfx mailing list > >>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx > >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C6ee2a428d88a4742f45a08d8a1cde9c7%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637437253067654506%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=WRL2smY7iemgZdlH3taUZCoa8h%2BuaKD1Hv0tbHUclAQ%3D&reserved=0 > >>>> > >> > -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx