[PATCH 01/11] drm/amdgpu: try allocating VRAM as power of two

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2018-09-11 10:48 AM, Christian König wrote:
> Am 11.09.2018 um 16:27 schrieb Kuehling, Felix:
>>> Otherwise we don't get pages larger than 2MB for the L1 on Vega10.
>> Yeah, I realized that after reviewing the rest of the patch series.
>>
>>> But another question: Why do you want to clear VRAM on allocation? We
>>> perfectly support allocating VRAM without clearing it.
>> As Michel pointed out, that's a security problem because it can leak
>> data between processes. Therefore all VRAM allocated through KFD APIs
>> is currently cleared on allocation. However, we have no requirement
>> from our user mode APIs to provide initialized memory. So if we could
>> skip the initialization it would be a big performance improvement in
>> applications that do lots of memory management.
>>
>> I was thinking, it should probably be easy to modify my drm_mm_mode
>> caching idea to deal with two node sizes: 2MB and 1GB. But having to
>> deal with arbitrary power-of-two sizes in between would make things
>> more challenging and less effective.
>
> It is rather unlikely that somebody wants to allocate 1GB, but a 32MB
> texture is perfectly possible. Would be a pity if we loose the extra
> L1 optimization.

True.

>
> But I don't really see a problem with that, cause you could still
> split up the larger allocations on freeing them (keep in mind that we
> always allocate enough nodes for 2MB pages).
>
> Additional to that I would actually not track drm_mm_nodes at all, but
> instead use an array where I note for each 2MB page who the last owner
> was.

That way you'd know whether a node was previously owned by the same
process. But you're leaving it to chance whether you actually get a
"pre-owned" node. I was also trying to deliberately allocate nodes that
were previously owned by the same process to maximize the effectiveness.

>
> If I'm not completely mistaken would only be around 64KB for 16GB of
> VRAM and much simpler than scanning recently freed up drm_mm_nodes.

The nice thing with my plan was that I wouldn't need to scan anything if
there were no placement restrictions. Just pick the first node from a
per-process list. The nodes on the list would have been all the same size.

Anyway, I don't want to hold up your change. I'll have to think of a
different way of doing this. And if this can't be done efficiently in
the kernel, the user mode team was even considering using something like
jemalloc. Feel free to submit your change and add my Reviewed-by.

Regards,
  Felix

>
> Regards,
> Christian.
>
>>
>> Regards,
>>    Felix
>>
>> ________________________________________
>> From: Christian König <ckoenig.leichtzumerken at gmail.com>
>> Sent: Tuesday, September 11, 2018 2:49:44 AM
>> To: Kuehling, Felix; amd-gfx at lists.freedesktop.org
>> Subject: Re: [PATCH 01/11] drm/amdgpu: try allocating VRAM as power
>> of two
>>
>> Yeah well the whole patch set depends on that :)
>>
>> Otherwise we don't get pages larger than 2MB for the L1 on Vega10.
>>
>> But another question: Why do you want to clear VRAM on allocation? We
>> perfectly support allocating VRAM without clearing it.
>>
>> Regards,
>> Christian.
>>
>> Am 11.09.2018 um 02:08 schrieb Felix Kuehling:
>>> This looks good. But it complicates something I've been looking at:
>>> Remembering which process drm_mm_nodes last belonged to, so that they
>>> don't need to be cleared next time they are allocated by the same
>>> process. Having most nodes the same size (vram_page_split pages) would
>>> make this very easy and efficient for the most common cases (large
>>> allocations without any exotic address limitations or alignment
>>> requirements).
>>>
>>> Does anything else in this patch series depend on this optimization?
>>>
>>> Regards,
>>>     Felix
>>>
>>>
>>> On 2018-09-09 02:03 PM, Christian König wrote:
>>>> Try to allocate VRAM in power of two sizes and only fallback to vram
>>>> split sizes if that fails.
>>>>
>>>> Signed-off-by: Christian König <christian.koenig at amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 52
>>>> +++++++++++++++++++++-------
>>>>    1 file changed, 40 insertions(+), 12 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
>>>> index 9cfa8a9ada92..3f9d5d00c9b3 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
>>>> @@ -124,6 +124,28 @@ u64 amdgpu_vram_mgr_bo_visible_size(struct
>>>> amdgpu_bo *bo)
>>>>       return usage;
>>>>    }
>>>>
>>>> +/**
>>>> + * amdgpu_vram_mgr_virt_start - update virtual start address
>>>> + *
>>>> + * @mem: ttm_mem_reg to update
>>>> + * @node: just allocated node
>>>> + *
>>>> + * Calculate a virtual BO start address to easily check if
>>>> everything is CPU
>>>> + * accessible.
>>>> + */
>>>> +static void amdgpu_vram_mgr_virt_start(struct ttm_mem_reg *mem,
>>>> +                                   struct drm_mm_node *node)
>>>> +{
>>>> +    unsigned long start;
>>>> +
>>>> +    start = node->start + node->size;
>>>> +    if (start > mem->num_pages)
>>>> +            start -= mem->num_pages;
>>>> +    else
>>>> +            start = 0;
>>>> +    mem->start = max(mem->start, start);
>>>> +}
>>>> +
>>>>    /**
>>>>     * amdgpu_vram_mgr_new - allocate new ranges
>>>>     *
>>>> @@ -176,10 +198,25 @@ static int amdgpu_vram_mgr_new(struct
>>>> ttm_mem_type_manager *man,
>>>>       pages_left = mem->num_pages;
>>>>
>>>>       spin_lock(&mgr->lock);
>>>> -    for (i = 0; i < num_nodes; ++i) {
>>>> +    for (i = 0; pages_left >= pages_per_node; ++i) {
>>>> +            unsigned long pages = rounddown_pow_of_two(pages_left);
>>>> +
>>>> +            r = drm_mm_insert_node_in_range(mm, &nodes[i], pages,
>>>> +                                            pages_per_node, 0,
>>>> +                                            place->fpfn, lpfn,
>>>> +                                            mode);
>>>> +            if (unlikely(r))
>>>> +                    break;
>>>> +
>>>> +            usage += nodes[i].size << PAGE_SHIFT;
>>>> +            vis_usage += amdgpu_vram_mgr_vis_size(adev, &nodes[i]);
>>>> +            amdgpu_vram_mgr_virt_start(mem, &nodes[i]);
>>>> +            pages_left -= pages;
>>>> +    }
>>>> +
>>>> +    for (; pages_left; ++i) {
>>>>               unsigned long pages = min(pages_left, pages_per_node);
>>>>               uint32_t alignment = mem->page_alignment;
>>>> -            unsigned long start;
>>>>
>>>>               if (pages == pages_per_node)
>>>>                       alignment = pages_per_node;
>>>> @@ -193,16 +230,7 @@ static int amdgpu_vram_mgr_new(struct
>>>> ttm_mem_type_manager *man,
>>>>
>>>>               usage += nodes[i].size << PAGE_SHIFT;
>>>>               vis_usage += amdgpu_vram_mgr_vis_size(adev, &nodes[i]);
>>>> -
>>>> -            /* Calculate a virtual BO start address to easily
>>>> check if
>>>> -             * everything is CPU accessible.
>>>> -             */
>>>> -            start = nodes[i].start + nodes[i].size;
>>>> -            if (start > mem->num_pages)
>>>> -                    start -= mem->num_pages;
>>>> -            else
>>>> -                    start = 0;
>>>> -            mem->start = max(mem->start, start);
>>>> +            amdgpu_vram_mgr_virt_start(mem, &nodes[i]);
>>>>               pages_left -= pages;
>>>>       }
>>>>       spin_unlock(&mgr->lock);
>



[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux