Re: [PATCH v3 05/12] drm/ttm: Expose ttm_tt_unpopulate for driver use

Christian König <christian.koenig@xxxxxxx> · Thu, 17 Dec 2020 21:10:07 +0100

[SNIP]
By eliminating such users, and replacing them with local maps which
are strictly bound in how long they can exist (and hence we can
serialize against them finishing in our hotunplug code).
Not sure I see how serializing against BO map/unmap helps - our 
problem as
you described is that once
device is extracted and then something else quickly takes it's place 
in the
PCI topology
and gets assigned same physical IO ranges, then our driver will 
start accessing this
new device because our 'zombie' BOs are still pointing to those ranges.
Until your driver's remove callback is finished the ranges stay 
reserved.

The ranges stay reserved until unmapped which happens in bo->destroy

I'm not sure of that. Why do you think that?

which for most internally allocated  buffers is during sw_fini when 
last drm_put
is called.

If that's not the case, then hotunplug would be fundamentally impossible
ot handle correctly.

Of course all the mmio actions will time out, so it might take some time
to get through it all.

I found that PCI code provides pci_device_is_present function
we can use to avoid timeouts - it reads device vendor and checks if 
all 1s is returned
or not. We can call it from within register accessors before trying 
read/write

That's way to much overhead! We need to keep that much lower or it will 
result in quite a performance drop.

I suggest to rather think about adding drm_dev_enter/exit guards.

Christian.

Another point regarding serializing - problem  is that some of those 
BOs are
very long lived, take for example the HW command
ring buffer Christian mentioned before -
(amdgpu_ring_init->amdgpu_bo_create_kernel), it's life span
is basically for the entire time the device exists, it's destroyed 
only in
the SW fini stage (when last drm_dev
reference is dropped) and so should I grab it's dma_resv_lock from
amdgpu_pci_remove code and wait
for it to be unmapped before proceeding with the PCI remove code ? 
This can
take unbound time and that why I don't understand
how serializing will help.
Uh you need to untangle that. After hw cleanup is done no one is allowed
to touch that ringbuffer bo anymore from the kernel.

I would assume we are not allowed to touch it once we identified the 
device is
gone in order to minimize the chance of accidental writes to some 
other device which might now
occupy those IO ranges ?

  That's what
drm_dev_enter/exit guards are for. Like you say we cant wait for all sw
references to disappear.

Yes, didn't make sense to me why would we use vmap_local for internally
allocated buffers. I think we should also guard registers read/writes 
for the
same reason as above.

The vmap_local is for mappings done by other drivers, through the 
dma-buf
interface (where "other drivers" can include fbdev/fbcon, if you use the
generic helpers).
-Daniel

Ok, so I assumed that with vmap_local you were trying to solve the 
problem of quick reinsertion
of another device into same MMIO range that my driver still points too 
but actually are you trying to solve
the issue of exported dma buffers outliving the device ? For this we 
have drm_device refcount in the GEM layer
i think.

Andrey

Andrey

It doesn't
solve all your problems, but it's a tool to get there.
-Daniel

Andrey

- handle fbcon somehow. I think shutting it all down should work 
out.
- worst case keep the system backing storage around for shared 
dma-buf
until the other non-dynamic driver releases it. for vram we require
dynamic importers (and maybe it wasn't such a bright idea to allow
pinning of importer buffers, might need to revisit that).

Cheers, Daniel

Christian.

Andrey

-Daniel

Christian.

I loaded the driver with vm_update_mode=3
meaning all VM updates done using CPU and hasn't seen any 
OOPs after
removing the device. I guess i can test it more by 
allocating GTT and
VRAM BOs
and trying to read/write to them after device is removed.

Andrey

Regards,
Christian.

Andrey
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C92654f053679415de74808d8a2838b3e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637438033181843512%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2BeS7v5CrHRfblj2FFCd4nrDLxUxzam6EyHM6poPkGc4%3D&amp;reserved=0 

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx