On 2021-06-20 10:14 a.m., Theodore Ts'o wrote:
On Thu, Jun 17, 2021 at 10:16:57AM -0500, Alex Sierra wrote:
v1:
AMD is building a system architecture for the Frontier supercomputer with a
coherent interconnect between CPUs and GPUs. This hardware architecture allows
the CPUs to coherently access GPU device memory. We have hardware in our labs
and we are working with our partner HPE on the BIOS, firmware and software
for delivery to the DOE.
The system BIOS advertises the GPU device memory (aka VRAM) as SPM
(special purpose memory) in the UEFI system address map. The amdgpu driver looks
it up with lookup_resource and registers it with devmap as MEMORY_DEVICE_GENERIC
using devm_memremap_pages.
Now we're trying to migrate data to and from that memory using the migrate_vma_*
helpers so we can support page-based migration in our unified memory allocations,
while also supporting CPU access to those pages.
This patch series makes a few changes to make MEMORY_DEVICE_GENERIC pages behave
correctly in the migrate_vma_* helpers. We are looking for feedback about this
approach. If we're close, what's needed to make our patches acceptable upstream?
If we're not close, any suggestions how else to achieve what we are trying to do
(i.e. page migration and coherent CPU access to VRAM)?
Is there a way we can test the codepaths touched by this patchset? It
doesn't have to be via a complete qemu simulation of the GPU device
memory, but some way of creating MEMORY_DEVICE_GENERIC subject to
migrate_vma_* helpers so we can test for regressions moving forward.
Hi Theodore,
I can think of two ways to test the changes for MEMORY_DEVICE_GENERIC in
this patch series in a way that is reproducible without special hardware
and firmware:
For the reference counting changes we could use the dax driver with hmem
and use efi_fake_mem on the kernel command line to create some
DEVICE_GENERIC pages. I'm open to suggestions for good user mode tests
to exercise dax functionality on this type of memory.
For the migration helper changes we could modify or parametrize
lib/hmm_test.c to create DEVICE_GENERIC pages instead of DEVICE_PRIVATE.
Then run tools/testing/selftests/vm/hmm-tests.c.
Regards,
Felix
Thanks,
- Ted