Re: [PATCH v2] resource: limit request_free_mem_region based on arch_get_mappable_range

D Scott Phillips <scott@xxxxxxxxxxxxxxxxxxxxxx> · Tue, 27 Aug 2024 19:57:01 -0700

Dan Williams <dan.j.williams@xxxxxxxxx> writes:

> D Scott Phillips wrote:
>> On arm64 prior to commit 32697ff38287 ("arm64: vmemmap: Avoid base2 order
>> of struct page size to dimension region"), the amdgpu driver could trip
>> over the warning of:
>> 
>> `WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));`
>> 
>> in vmemmap_populate()[1]. After that commit, it becomes a translation fault
>> and panic[2].
>> 
>> The cause is that the amdgpu driver allocates some unused space from
>> iomem_resource and claims it as MEMORY_DEVICE_PRIVATE and
>> devm_memremap_pages() it. An address above those backed by the arm64
>> vmemmap is picked.
>> 
>> Limit request_free_mem_region() so that only addresses within the
>> arch_get_mappable_range() can be chosen as device private addresses.
>
> It seems odd that devm_request_free_mem_region() needs to be careful
> about this restriction. The caller passes in the resource tree that is
> the bounds of valid address ranges. This change assumes that the caller
> wants to be restricted to vmemmap capable address ranges beyond the
> restrictions it already requested in the passed in @base argument. That
> restriction may be true with respect to request_mem_region(), but not
> necessarily other users of get_free_mem_region() like
> alloc_free_mem_region().
>
> So, 2 questions / change request options:
>
> 1/ Preferred: Is there a possibility for the AMD driver to trim the
> resource it is passing to be bound by arch_get_mappable_range()? For CXL
> this is achieved by inserting CXL aperture windows into the resource
> tree.
>
> In the future what happens in the MEMORY_DEVICE_PUBLIC case when the
> memory address is picked by a hardware aperture on the device? It occurs
> to me if that aperture is communicated to the device via some platform
> mechanism (to honor arch_get_mappable_range() restrictions), then maybe
> the same should be done here.
>
> I have always cringed at the request_free_mem_region() implementation
> playing fast and loose with the platform memory map. Maybe this episode
> is a sign that these constraints need more formal handling in the
> resource tree.
>
> I.e. IORES_DESC_DEVICE_PRIVATE_MEMORY becomes a platform communicated
> aperture rather than hoping that unused portions of iomem_resource can
> be repurposed like this.

Hi Dan, sorry for my incredibly delayed response, I lost your message to
a filter on my end :(

I'm happy to work toward your preferred approach here, though I'm not
sure I know how to achieve it. I think I understand how cxl is keeping
device_private_memory out, but I don't think I understand the resource
system well enough to see how amdgpu can make a properly trimmed
resource for request_free_mem_region. My novice attempt would be
something like:

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 8ee3d07ffbdfa..d84de6d66ac45 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -1038,7 +1039,14 @@ int kgd2kfd_init_zone_device(struct amdgpu_device *adev)
                pgmap->range.end = adev->gmc.aper_base + adev->gmc.aper_size - 1;
                pgmap->type = MEMORY_DEVICE_COHERENT;
        } else {
-               res = devm_request_free_mem_region(adev->dev, &iomem_resource, size);
+               struct range mappable;
+               struct resource root;
+
+               mappable = arch_get_mappable_range();
+               root.start = mappable.start;
+               root.end = mappable.end;
+               root.child = iomem_resource.child;
+               res = devm_request_free_mem_region(adev->dev, &root, size);
                if (IS_ERR(res))
                        return PTR_ERR(res);
                pgmap->range.start = res->start;

Apart from this being wrong with respect to resource_lock, is that sort
of the idea? or am I missing the sensible way to hoist the vmemmap range
into iomem_resource? or maybe I'm just totally off in the weeds.