Re: [PATCH v3 3/9] maple_tree: Introduce interfaces __mt_dup() and mtree_dup()

"Liam R. Howlett" <Liam.Howlett@xxxxxxxxxx> · Thu, 5 Oct 2023 15:09:05 -0400

* Peng Zhang <zhangpeng.00@xxxxxxxxxxxxx> [231005 11:55]:
> 
> 
> 在 2023/10/4 22:25, Liam R. Howlett 写道:
> > * Peng Zhang <zhangpeng.00@xxxxxxxxxxxxx> [231004 05:09]:
> > > 
> > > 
> > > 在 2023/10/4 02:45, Liam R. Howlett 写道:
> > > > * Peng Zhang <zhangpeng.00@xxxxxxxxxxxxx> [230924 23:58]:
> > > > > Introduce interfaces __mt_dup() and mtree_dup(), which are used to
> > > > > duplicate a maple tree. They duplicate a maple tree in Depth-First
> > > > > Search (DFS) pre-order traversal. It uses memcopy() to copy nodes in the
> > > > > source tree and allocate new child nodes in non-leaf nodes. The new node
> > > > > is exactly the same as the source node except for all the addresses
> > > > > stored in it. It will be faster than traversing all elements in the
> > > > > source tree and inserting them one by one into the new tree. The time
> > > > > complexity of these two functions is O(n).
> > > > > 
> > > > > The difference between __mt_dup() and mtree_dup() is that mtree_dup()
> > > > > handles locks internally.
> > > > > 
> > > > > Signed-off-by: Peng Zhang <zhangpeng.00@xxxxxxxxxxxxx>
> > > > > ---
> > > > >    include/linux/maple_tree.h |   3 +
> > > > >    lib/maple_tree.c           | 286 +++++++++++++++++++++++++++++++++++++
> > > > >    2 files changed, 289 insertions(+)
> > > > > 
> > > > > diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
> > > > > index 666a3764ed89..de5a4056503a 100644
> > > > > --- a/include/linux/maple_tree.h
> > > > > +++ b/include/linux/maple_tree.h
> > > > > @@ -329,6 +329,9 @@ int mtree_store(struct maple_tree *mt, unsigned long index,
> > > > >    		void *entry, gfp_t gfp);
> > > > >    void *mtree_erase(struct maple_tree *mt, unsigned long index);
> > > > > +int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
> > > > > +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
> > > > > +
> > > > >    void mtree_destroy(struct maple_tree *mt);
> > > > >    void __mt_destroy(struct maple_tree *mt);
> > > > > diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> > > > > index 3fe5652a8c6c..ed8847b4f1ff 100644
> > > > > --- a/lib/maple_tree.c
> > > > > +++ b/lib/maple_tree.c
> > > > > @@ -6370,6 +6370,292 @@ void *mtree_erase(struct maple_tree *mt, unsigned long index)
> > > > >    }
> > > > >    EXPORT_SYMBOL(mtree_erase);
> > > > > +/*
> > > > > + * mas_dup_free() - Free an incomplete duplication of a tree.
> > > > > + * @mas: The maple state of a incomplete tree.
> > > > > + *
> > > > > + * The parameter @mas->node passed in indicates that the allocation failed on
> > > > > + * this node. This function frees all nodes starting from @mas->node in the
> > > > > + * reverse order of mas_dup_build(). There is no need to hold the source tree
> > > > > + * lock at this time.
> > > > > + */
> > > > > +static void mas_dup_free(struct ma_state *mas)
> > > > > +{
> > > > > +	struct maple_node *node;
> > > > > +	enum maple_type type;
> > > > > +	void __rcu **slots;
> > > > > +	unsigned char count, i;
> > > > > +
> > > > > +	/* Maybe the first node allocation failed. */
> > > > > +	if (mas_is_none(mas))
> > > > > +		return;
> > > > > +
> > > > > +	while (!mte_is_root(mas->node)) {
> > > > > +		mas_ascend(mas);
> > > > > +
> > > > > +		if (mas->offset) {
> > > > > +			mas->offset--;
> > > > > +			do {
> > > > > +				mas_descend(mas);
> > > > > +				mas->offset = mas_data_end(mas);
> > > > > +			} while (!mte_is_leaf(mas->node));
> > > > > +
> > > > > +			mas_ascend(mas);
> > > > > +		}
> > > > > +
> > > > > +		node = mte_to_node(mas->node);
> > > > > +		type = mte_node_type(mas->node);
> > > > > +		slots = ma_slots(node, type);
> > > > > +		count = mas_data_end(mas) + 1;
> > > > > +		for (i = 0; i < count; i++)
> > > > > +			((unsigned long *)slots)[i] &= ~MAPLE_NODE_MASK;
> > > > > +
> > > > > +		mt_free_bulk(count, slots);
> > > > > +	}
> > > > > +
> > > > > +	node = mte_to_node(mas->node);
> > > > > +	mt_free_one(node);
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * mas_copy_node() - Copy a maple node and replace the parent.
> > > > > + * @mas: The maple state of source tree.
> > > > > + * @new_mas: The maple state of new tree.
> > > > > + * @parent: The parent of the new node.
> > > > > + *
> > > > > + * Copy @mas->node to @new_mas->node, set @parent to be the parent of
> > > > > + * @new_mas->node. If memory allocation fails, @mas is set to -ENOMEM.
> > > > > + */
> > > > > +static inline void mas_copy_node(struct ma_state *mas, struct ma_state *new_mas,
> > > > > +		struct maple_pnode *parent)
> > > > > +{
> > > > > +	struct maple_node *node = mte_to_node(mas->node);
> > > > > +	struct maple_node *new_node = mte_to_node(new_mas->node);
> > > > > +	unsigned long val;
> > > > > +
> > > > > +	/* Copy the node completely. */
> > > > > +	memcpy(new_node, node, sizeof(struct maple_node));
> > > > > +
> > > > > +	/* Update the parent node pointer. */
> > > > > +	val = (unsigned long)node->parent & MAPLE_NODE_MASK;
> > > > > +	new_node->parent = ma_parent_ptr(val | (unsigned long)parent);
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * mas_dup_alloc() - Allocate child nodes for a maple node.
> > > > > + * @mas: The maple state of source tree.
> > > > > + * @new_mas: The maple state of new tree.
> > > > > + * @gfp: The GFP_FLAGS to use for allocations.
> > > > > + *
> > > > > + * This function allocates child nodes for @new_mas->node during the duplication
> > > > > + * process. If memory allocation fails, @mas is set to -ENOMEM.
> > > > > + */
> > > > > +static inline void mas_dup_alloc(struct ma_state *mas, struct ma_state *new_mas,
> > > > > +		gfp_t gfp)
> > > > > +{
> > > > > +	struct maple_node *node = mte_to_node(mas->node);
> > > > > +	struct maple_node *new_node = mte_to_node(new_mas->node);
> > > > > +	enum maple_type type;
> > > > > +	unsigned char request, count, i;
> > > > > +	void __rcu **slots;
> > > > > +	void __rcu **new_slots;
> > > > > +	unsigned long val;
> > > > > +
> > > > > +	/* Allocate memory for child nodes. */
> > > > > +	type = mte_node_type(mas->node);
> > > > > +	new_slots = ma_slots(new_node, type);
> > > > > +	request = mas_data_end(mas) + 1;
> > > > > +	count = mt_alloc_bulk(gfp, request, (void **)new_slots);
> > > > > +	if (unlikely(count < request)) {
> > > > > +		if (count) {
> > > > > +			mt_free_bulk(count, new_slots);
> > > > 
> > > > If you look at mm/slab.c: kmem_cache_alloc(), you will see that the
> > > > error path already bulk frees for you - but does not zero the array.
> > > > This bulk free will lead to double free, but you do need the below
> > > > memset().  Also, it will return !count or request. So, I think this code
> > > > is never executed as it is written.
> > > If kmem_cache_alloc() is called to allocate memory in mt_alloc_bulk(),
> > > then this code will not be executed because it only returns 0 or
> > > request. However, I am concerned that changes to mt_alloc_bulk() like
> > > [1] may be merged, which could potentially lead to memory leaks. To
> > > improve robustness, I wrote it this way.
> > > 
> > > How do you think it should be handled? Is it okay to do this like the
> > > code below?
> > > 
> > > if (unlikely(count < request)) {
> > > 	memset(new_slots, 0, request * sizeof(unsigned long));
> > > 	mas_set_err(mas, -ENOMEM);
> > > 	return;
> > > }
> > > 
> > > [1] https://lore.kernel.org/lkml/20230810163627.6206-13-vbabka@xxxxxxx/
> > 
> > Ah, I see.
> > 
> > We should keep the same functionality as before.  The code you are
> > referencing is an RFC and won't be merged as-is.  We should be sure to
> > keep an eye on this happening.
> > 
> > I think the code you have there is correct.
> > 
> > > > 
> > > > I don't think this will show up in your testcases because the test code
> > > > doesn't leave dangling pointers and simply returns 0 if there isn't
> > > > enough nodes.
> > > Yes, no testing here.
> > 
> > Yeah :/  I think we should update the test code at some point to behave
> > the same as the real code.  Don't worry about it here though.
> If we want to test this here, we need to modify the
> kmem_cache_alloc_bulk() in the user space to allocate a portion of
> memory. This will cause it to behave differently from the corresponding
> function in the kernel space. I'm not sure if this modification is
> acceptable.

Well, no.  We can change the test code to do the same as the kernel -
if there aren't enough nodes then put pointers to the ones we have in
the array but don't actually allocate them (or free them).  This way we
will catch double-frees or memory leaks.  Essentially leaving the
partial data in the array.

If you want to test it in-kernel, then you could alter the kernel
function to check the task name with some global counter to cause it to
fail out after some count.  That would just be a one-off test though.

> 
> Also, I might need to move the memset() outside of the if
> statement (if (unlikely(count < request)){}) to use it for cleaning up
> residual pointers.

Outside the "if (count)" statement? Yes.  I think you have this above in
the "code below" section, right?

Thanks,
Liam