On 11/24/20 11:44 AM, Christian König wrote:
Am 24.11.20 um 17:22 schrieb Andrey Grodzovsky:
On 11/24/20 2:41 AM, Christian König wrote:
Am 23.11.20 um 22:08 schrieb Andrey Grodzovsky:
On 11/23/20 3:41 PM, Christian König wrote:
Am 23.11.20 um 21:38 schrieb Andrey Grodzovsky:
On 11/23/20 3:20 PM, Christian König wrote:
Am 23.11.20 um 21:05 schrieb Andrey Grodzovsky:
On 11/25/20 5:42 AM, Christian König wrote:
Am 21.11.20 um 06:21 schrieb Andrey Grodzovsky:
It's needed to drop iommu backed pages on device unplug
before device's IOMMU group is released.
It would be cleaner if we could do the whole handling in TTM. I also
need to double check what you are doing with this function.
Christian.
Check patch "drm/amdgpu: Register IOMMU topology notifier per device."
to see
how i use it. I don't see why this should go into TTM mid-layer - the
stuff I do inside
is vendor specific and also I don't think TTM is explicitly aware of
IOMMU ?
Do you mean you prefer the IOMMU notifier to be registered from within TTM
and then use a hook to call into vendor specific handler ?
No, that is really vendor specific.
What I meant is to have a function like ttm_resource_manager_evict_all()
which you only need to call and all tt objects are unpopulated.
So instead of this BO list i create and later iterate in amdgpu from the
IOMMU patch you just want to do it within
TTM with a single function ? Makes much more sense.
Yes, exactly.
The list_empty() checks we have in TTM for the LRU are actually not the
best idea, we should now check the pin_count instead. This way we could
also have a list of the pinned BOs in TTM.
So from my IOMMU topology handler I will iterate the TTM LRU for the
unpinned BOs and this new function for the pinned ones ?
It's probably a good idea to combine both iterations into this new function
to cover all the BOs allocated on the device.
Yes, that's what I had in my mind as well.
BTW: Have you thought about what happens when we unpopulate a BO while we
still try to use a kernel mapping for it? That could have unforeseen
consequences.
Are you asking what happens to kmap or vmap style mapped CPU accesses once
we drop all the DMA backing pages for a particular BO ? Because for user
mappings
(mmap) we took care of this with dummy page reroute but indeed nothing was
done for in kernel CPU mappings.
Yes exactly that.
In other words what happens if we free the ring buffer while the kernel
still writes to it?
Christian.
While we can't control user application accesses to the mapped buffers
explicitly and hence we use page fault rerouting
I am thinking that in this case we may be able to sprinkle
drm_dev_enter/exit in any such sensitive place were we might
CPU access a DMA buffer from the kernel ?
Yes, I fear we are going to need that.
Things like CPU page table updates, ring buffer accesses and FW memcpy ? Is
there other places ?
Puh, good question. I have no idea.
Another point is that at this point the driver shouldn't access any such
buffers as we are at the process finishing the device.
AFAIK there is no page fault mechanism for kernel mappings so I don't think
there is anything else to do ?
Well there is a page fault handler for kernel mappings, but that one just
prints the stack trace into the system log and calls BUG(); :)
Long story short we need to avoid any access to released pages after unplug.
No matter if it's from the kernel or userspace.
I was just about to start guarding with drm_dev_enter/exit CPU accesses from
kernel to GTT ot VRAM buffers but then i looked more in the code
and seems like ttm_tt_unpopulate just deletes DMA mappings (for the sake of
device to main memory access). Kernel page table is not touched
until last bo refcount is dropped and the bo is released
(ttm_bo_release->destroy->amdgpu_bo_destroy->amdgpu_bo_kunmap). This is both
for GTT BOs maped to kernel by kmap (or vmap) and for VRAM BOs mapped by
ioremap. So as i see it, nothing will bad will happen after we
unpopulate a BO while we still try to use a kernel mapping for it, system memory
pages backing GTT BOs are still mapped and not freed and for
VRAM BOs same is for the IO physical ranges mapped into the kernel page table
since iounmap wasn't called yet. I loaded the driver with vm_update_mode=3
meaning all VM updates done using CPU and hasn't seen any OOPs after removing
the device. I guess i can test it more by allocating GTT and VRAM BOs
and trying to read/write to them after device is removed.
Andrey
Regards,
Christian.
Andrey
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx