Re: [PATCH 2/2] Add drm buddy manager support to amdgpu driver

Christian König <ckoenig.leichtzumerken@xxxxxxxxx> · Wed, 22 Sep 2021 08:48:26 +0200

Am 21.09.21 um 17:51 schrieb Paneer Selvam, Arunpravin:
[AMD Public Use]

Hi Christian,
Please find my comments.

A better mail client might be helpful for mailing list communication. I 
use Thunderbird, but Outlook with appropriate setting should do as well.

Thanks,
Arun
-----Original Message-----
From: Koenig, Christian <Christian.Koenig@xxxxxxx>
Sent: Tuesday, September 21, 2021 2:34 PM
To: Paneer Selvam, Arunpravin <Arunpravin.PaneerSelvam@xxxxxxx>; dri-devel@xxxxxxxxxxxxxxxxxxxxx; intel-gfx@xxxxxxxxxxxxxxxxxxxxx; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; matthew.auld@xxxxxxxxx; daniel@xxxxxxxx; Deucher, Alexander <Alexander.Deucher@xxxxxxx>
Subject: Re: [PATCH 2/2] Add drm buddy manager support to amdgpu driver

Am 20.09.21 um 21:21 schrieb Arunpravin:
[SNIP]
+	struct list_head blocks;
+};
+
+static inline struct amdgpu_vram_mgr_node *
+to_amdgpu_vram_mgr_node(struct ttm_resource *res) {
+	return container_of(container_of(res, struct ttm_range_mgr_node, base),
+			struct amdgpu_vram_mgr_node, tnode); }
+
Maybe stuff that in a separate amdgpu_vram_mgr.h file together with all the other defines for the vram manager.

Arun - I thought about it, will create a new header file for vram manager

Maybe make that a separate patch before this one here.

+	if (mode == DRM_BUDDY_ALLOC_RANGE) {
+		r = drm_buddy_alloc_range(mm, &vnode->blocks,
+				(uint64_t)place->fpfn << PAGE_SHIFT, pages << PAGE_SHIFT);
That handling won't work. It's possible that you need contiguous memory in a specific range.

Arun - the existing default backend range handler allocates contiguous nodes in power of 2 finding the MSB's of
the any given size. We get linked nodes (depends on the requested size) in continuous range of address.
Example, for the size 768 pages request, we get 512 + 256 range of continuous address in 2 nodes.

It works by passing the fpfn and the requested size, the backend handler calculates the lpfn by adding fpfn + size = lpfn.
The drawback here are we are not handling the specific lpfn value (as of now it is calculated using the fpfn + requested size)
and not following the pages_per_node rule.

Please let me know if this won't work for all specific fpfn / lpfn cases

From your description that sounds like it won't work at all for any cases.

See the fpfn/lpfn specifies the range of allocation. For the most common 
case that's either 0..visible_vram or 0..start_of_some_hw_limitation.

When you always try to allocate the range from 0 you will quickly find 
that you clash with existing allocations.

What you need to do in general is to have a drm_buddy_alloc() which is 
able to limit the returned page to the desired range fpfn..lpfn.

+
+			do {
+				unsigned int order;
+
+				order = fls(n_pages) - 1;
+				BUG_ON(order > mm->max_order);
+
+				spin_lock(&mgr->lock);
+				block = drm_buddy_alloc(mm, order, bar_limit_enabled,
+									visible_pfn, mode);
That doesn't seem to make much sense either. The backend allocator should not care about the BAR size nor the visible_pfn.

Arun - we are sending the BAR limit enable information (in case of APU or large BAR, we take different approach) and visible_pfn
Information.

In case of bar_limit_enabled is true, I thought visible_pfn required for the backend allocator to compare with the block start address
and find the desired blocks for the TOP-DOWN and BOTTOM-UP approach (TOP-DOWN - return blocks higher than the visible_pfn limit,
BOTTOM-UP - return blocks lower than the visible_pfn limit).

In case of bar_limit_enabled is false, we just return the top ordered blocks and bottom most blocks for the TOP-DOWN and BOTTOM-UP
respectively (suitable for APU and Large BAR case).

Please let me know if we have other way to fix this problem

That is the completely wrong approach. The backend must not care about 
the BAR configuration and visibility of the VRAM.

What it should do instead is to take the fpfn..lpfn range into account 
and make sure that all allocated pages are in the desired range.

BOTTOM-UP vs. TOP-DOWN then just optimizes the algorithm because we 
insert all freed up TOP-DOWN pages at the end and all BOTTOM-UP pages at 
the front and on new allocations walk the lest of free pages from the 
front or back depending on the flag.

+				spin_unlock(&mgr->lock);

-		vis_usage += amdgpu_vram_mgr_vis_size(adev, &node->mm_nodes[i]);
-		amdgpu_vram_mgr_virt_start(&node->base, &node->mm_nodes[i]);
-		pages_left -= pages;
-		++i;
+				if (IS_ERR(block)) {
+					r = -ENOSPC;
+					goto error_free_blocks;
+				}

-		if (pages > pages_left)
-			pages = pages_left;
+				n_pages -= BIT(order);
+
+				list_add_tail(&block->link, &vnode->blocks);
+
+				if (!n_pages)
+					break;
+			} while (1);
+
+			pages_left -= pages;
+			++i;
+
+			if (pages > pages_left)
+				pages = pages_left;
+		}
   	}
+
+	spin_lock(&mgr->lock);
+	list_sort(NULL, &vnode->blocks, sort_blocks);
Why do you sort the list here?

Regards,
Christian.

Arun - It gave better GLmark2 score when we sort the blocks in ascending order,
Its not required, I will remove it

Interesting.  Maybe add a TODO comment so that somebody could 
investigate why that happens.

Regards,
Christian.