On 7/22/22 08:34, Jason Gunthorpe
wrote:
On Thu, Jul 21, 2022 at 07:00:23PM -0400, Felix Kuehling wrote:Hi all, We're noticing some unexpected behaviour when the amdgpu and Mellanox drivers are interacting on shared memory with hmm_range_fault. If the amdgpu driver migrated pages to DEVICE_PRIVATE memory, we would expect hmm_range_fault called by the Mellanox driver to fault them back to system memory. But that's not happening. Instead hmm_range_fault fails. For an experiment, Philip hacked hmm_vma_handle_pte to treat DEVICE_PRIVATE pages like device_exclusive pages, which gave us the expected behaviour. It would result in a dev_pagemap_ops.migrate_to_ram callback in our driver, and hmm_range_fault would return system memory pages to the Mellanox driver. So something is clearly wrong. It could be: * our expectations are wrong, * the implementation of hmm_range_fault is wrong, or * our driver is missing something when migrating to DEVICE_PRIVATE memory. Do you have any insights?I think it is a bug Jason
Yes, looks like a bug to me too. In hmm_vma_handle_pte(), it calls hmm_is_device_private_entry() which correctly handles the case where the device private entry is owned by the driver calling hmm_range_fault() but then does nothing to fault in the page if it is a device private entry not owned by the driver. I'll work with Alistair and one of us will post a fix. Thanks for finding this!