Hi Peng,
On 6/25/23 05:28, Peng Zhang wrote:
在 2023/6/23 00:41, Danilo Krummrich 写道:
On 6/12/23 22:39, Liam R. Howlett wrote:
Calculate the number of nodes based on the pending write action instead
of assuming the worst case.
Liam already gave me a heads-up on this patch, which I already replied
to [1].
However, I think it might make sense to also reply to this patch
directly.
For a mas_preallocate() calculating the actual required nodes to be
allocated instead of assuming the worst to work, it is required to
ensure that the tree does not change between calling mas_preallocate()
and mas_store_prealloc() if my understanding is correct.
In DRM however, more specifically the DRM GPUVA Manager [2], we do
have the case that we are not able to ensure this:
Jobs to create GPU mappings can be submitted by userspace, are queued
up by the kernel and are processed asynchronously in dma-fence
signalling critical paths, e.g. by using the drm_gpu_scheduler. Hence,
we must be able to allocate the worst case amount of node, since at
the time a job is submitted we can't predict the state the maple tree
keeping track of mappings has once a mapping is inserted in the
(asynchronous) dma-fence signalling critical path.
A more detailed explanation can be found in [1].
Could we keep a separate function for allocating the worst case amount
of nodes additionally to this optimization? E.g. something like
mas_preallocate_worst_case() or mas_preallocate_unlocked() (since I
guess the new one requires the maple tree to be kept locked in order
not to change)?
Hi Danilo,
Your understanding seems incorrect. Even with previously unoptimized
mas_preallocate(), the maple tree cannot be modified between calls to
mas_preallocate() and mas_store_prealloc(). The calculation of the
number of pre-allocated nodes depends on the structure of the maple
tree. In the unoptimized mas_preallocate(), it depends on the height of
the tree. If the maple tree is modified before mas_store_prealloc() and
the height of the tree changes, the number of pre-allocated nodes is
inaccurate.
Thanks for pointing this out!
First of all, it's probably fair to say "naive me", it totally makes
sense the tree height is needed - it's a b-tree.
On the other hand, unless I miss something (and if so, please let me
know), something is bogus with the API then.
While the documentation of the Advanced API of the maple tree explicitly
claims that the user of the API is responsible for locking, this should
be limited to the bounds set by the maple tree implementation. Which
means, the user must decide for either the internal (spin-) lock or an
external lock (which possibly goes away in the future) and acquire and
release it according to the rules maple tree enforces through lockdep
checks.
Let's say one picks the internal lock. How is one supposed to ensure the
tree isn't modified using the internal lock with mas_preallocate()?
Besides that, I think the documentation should definitely mention this
limitation and give some guidance for the locking.
Currently, from an API perspective, I can't see how anyone not familiar
with the implementation details would be able to recognize this limitation.
In terms of the GPUVA manager, unfortunately, it seems like I need to
drop the maple tree and go back to using a rb-tree, since it seems there
is no sane way doing a worst-case pre-allocation that does not suffer
from this limitation.
- Danilo
Regards,
Peng
[1]
https://lore.kernel.org/nouveau/68cd25de-e767-725e-2e7b-703217230bb0@xxxxxxxxxx/T/#ma326e200b1de1e3c9df4e9fcb3bf243061fee8b5
[2]
https://lore.kernel.org/linux-mm/20230620004217.4700-8-dakr@xxxxxxxxxx/T/#m47ab82310f87793d0f0cc1825a316eb30ad5b653
- Danilo
This addresses a performance regression introduced in platforms that
have longer allocation timing.
Signed-off-by: Liam R. Howlett <Liam.Howlett@xxxxxxxxxx>
---
lib/maple_tree.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 47 insertions(+), 1 deletion(-)
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 048d6413a114..7ac5b5457603 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -5541,9 +5541,55 @@ EXPORT_SYMBOL_GPL(mas_store_prealloc);
*/
int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
{
+ MA_WR_STATE(wr_mas, mas, entry);
+ unsigned char node_size;
+ int request = 1;
int ret;
- mas_node_count_gfp(mas, 1 + mas_mt_height(mas) * 3, gfp);
+
+ if (unlikely(!mas->index && mas->last == ULONG_MAX))
+ goto ask_now;
+
+ mas_wr_store_setup(&wr_mas);
+ wr_mas.content = mas_start(mas);
+ /* Root expand */
+ if (unlikely(mas_is_none(mas) || mas_is_ptr(mas)))
+ goto ask_now;
+
+ if (unlikely(!mas_wr_walk(&wr_mas))) {
+ /* Spanning store, use worst case for now */
+ request = 1 + mas_mt_height(mas) * 3;
+ goto ask_now;
+ }
+
+ /* At this point, we are at the leaf node that needs to be
altered. */
+ /* Exact fit, no nodes needed. */
+ if (wr_mas.r_min == mas->index && wr_mas.r_max == mas->last)
+ return 0;
+
+ mas_wr_end_piv(&wr_mas);
+ node_size = mas_wr_new_end(&wr_mas);
+ /* Slot store can avoid using any nodes */
+ if (node_size == wr_mas.node_end && wr_mas.offset_end -
mas->offset == 1)
+ return 0;
+
+ if (node_size >= mt_slots[wr_mas.type]) {
+ /* Split, worst case for now. */
+ request = 1 + mas_mt_height(mas) * 2;
+ goto ask_now;
+ }
+
+ /* Appending does not need any nodes */
+ if (node_size == wr_mas.node_end + 1 && mas->offset ==
wr_mas.node_end)
+ return 0;
+
+ /* Potential spanning rebalance collapsing a node, use
worst-case */
+ if (node_size - 1 <= mt_min_slots[wr_mas.type])
+ request = mas_mt_height(mas) * 2 - 1;
+
+ /* node store needs one node */
+ask_now:
+ mas_node_count_gfp(mas, request, gfp);
mas->mas_flags |= MA_STATE_PREALLOC;
if (likely(!mas_is_err(mas)))
return 0;