[PATCH] maple_tree: correct tree corruption on spanning store

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Writing a data range into a maple tree may involve overwriting a number of
existing entries that span across more than one node. Doing so invokes a
'spanning' store.

Performing a spanning store across two leaf nodes in a maple tree in which
entries are overwritten is achieved by first initialising a 'big' node,
which will store the coalesced entries between the two nodes comprising
entries prior to the newly stored entry, the newly stored entry, and
subsequent entries.

This 'big node' is then merged back into the tree and the tree is
rebalanced, replacing the entries across the spanned nodes with those
contained in the big node.

The operation is performed in mas_wr_spanning_store() which starts by
establishing two maple tree state objects ('mas' objects) on the left of
the range and on the right (l_mas and r_mas respectively).

l_mas traverses to the beginning of the range to be stored in order to copy
the data BEFORE the requested store into the big node.

We then insert our new entry immediately afterwards (both the left copy and
the storing of the new entry are combined and performed by
mas_store_b_node()).

r_mas traverses to the populated slot immediately after, in order to copy
the data AFTER the requested store into the big node.

This copy of the right-hand node is performed by mas_mab_cp() as long as
r_mas indicates that there's data to copy, i.e. r_mas.offset <= r_mas.end.

We traverse r_mas to this position in mas_wr_node_walk() using a simple
loop:

	while (offset < count && mas->index > wr_mas->pivots[offset])
		offset++;

Note here that count is determined to be the (inclusive) index of the last
node containing data in the node as determined by ma_data_end().

This means that even in searching for mas->index, which will have been set
to one plus the end of the target range in order to traverse to the next
slot in mas_wr_spanning_store(), we will terminate the iteration at the end
of the node range even if this condition is not met due to the offset <
count condition.

The fact this right hand node contains the end of the range being stored is
why we are traversing it, and this loop is why we appear to discover a
viable range within the right node to copy to the big one.

However, if the node that r_mas traverses contains a pivot EQUAL to the end
of the range being stored, and this is the LAST pivot contained within the
node, something unexpected happens:

1. The l_mas traversal copy and insertion of the new entry in the big node
   is performed via mas_store_b_node() correctly.

2. The traversal performed by mas_wr_node_walk() means our r_mas.offset is
   set to the offset of the entry equal to the end of the range we store.

3. We therefore copy this DUPLICATE of the final pivot into the big node,
   and insert this DUPLICATE entry, alongside its invalid slot entry
   immediately after the newly inserted entry.

4. The big node containing this duplicated is inserted into the tree which
   is rebalanced, and therefore the maple tree becomes corrupted.

Note that if the right hand node had one or more entries with pivots of
greater value than the end of the stored range, this would not happen. If
it contained entries with pivots of lesser value it would not be the right
node in this spanning store.

This appears to have been at risk of happening throughout the maple tree's
history, however it seemed significantly less likely to occur until
recently.

The balancing of the tree seems to have made it unlikely that you would
happen to perform a store that both spans two nodes AND would overwrite
precisely the entry with the largest pivot in the right-hand node which
contains no further larger pivots.

The work performed in commit f8d112a4e657 ("mm/mmap: avoid zeroing vma tree
in mmap_region()") seems to have made the probability of this event much
more likely.

Previous to this change, MAP_FIXED mappings which were overwritten would
first be cleared before any subsequent store or importantly - merge of
surrounding entries - would be performed.

After this change, this is no longer the case, and this means that, in the
worst case, a number of entries might be overwritten in combination with a
merge (and subsequent overwriting expansion) between both the prior entry
AND a subsequent entry.

The motivation for this change arose from Bert Karwatzki's report of
encountering mm instability after the release of kernel v6.12-rc1 which,
after the use of CONFIG_DEBUG_VM_MAPLE_TREE and similar configuration
options, was identified as maple tree corruption.

After Bert very generously provided his time and ability to reproduce this
event consistently, I was able to finally identify that the issue discussed
in this commit message was occurring for him.

The solution implemented in this patch is:

1. Adjust mas_wr_walk_index() to return a boolean value indicating whether
   the containing node is actually populated with entries possessing pivots
   equal to or greater than mas->index.

2. When traversing the right node in mas_wr_spanning_store(), use this
   value to determine whether to try to copy from the right node - if it is
   not populated, then do not do so.

This passes all maple tree unit tests and resolves the reported bug.
---
 lib/maple_tree.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 37abf0fe380b..e6f0da908ba7 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -2194,6 +2194,8 @@ static inline void mas_node_or_none(struct ma_state *mas,

 /*
  * mas_wr_node_walk() - Find the correct offset for the index in the @mas.
+ *                      If @mas->index cannot be found within the containing
+ *                      node, we traverse to the last entry in the node.
  * @wr_mas: The maple write state
  *
  * Uses mas_slot_locked() and does not need to worry about dead nodes.
@@ -3527,6 +3529,12 @@ static bool mas_wr_walk(struct ma_wr_state *wr_mas)
 	return true;
 }

+/*
+ * Traverse the maple tree until the offset of mas->index is reached.
+ *
+ * Return: Is this node actually populated with entries possessing pivots equal
+ *         to or greater than mas->index?
+ */
 static bool mas_wr_walk_index(struct ma_wr_state *wr_mas)
 {
 	struct ma_state *mas = wr_mas->mas;
@@ -3535,8 +3543,11 @@ static bool mas_wr_walk_index(struct ma_wr_state *wr_mas)
 		mas_wr_walk_descend(wr_mas);
 		wr_mas->content = mas_slot_locked(mas, wr_mas->slots,
 						  mas->offset);
-		if (ma_is_leaf(wr_mas->type))
-			return true;
+		if (ma_is_leaf(wr_mas->type)) {
+			unsigned long pivot = wr_mas->pivots[mas->offset];
+
+			return pivot == 0 || mas->index <= pivot;
+		}
 		mas_wr_walk_traverse(wr_mas);

 	}
@@ -3696,6 +3707,7 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas)
 	struct maple_big_node b_node;
 	struct ma_state *mas;
 	unsigned char height;
+	bool r_populated;

 	/* Left and Right side of spanning store */
 	MA_STATE(l_mas, NULL, 0, 0);
@@ -3737,7 +3749,7 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas)
 		r_mas.last++;

 	r_mas.index = r_mas.last;
-	mas_wr_walk_index(&r_wr_mas);
+	r_populated = mas_wr_walk_index(&r_wr_mas);
 	r_mas.last = r_mas.index = mas->last;

 	/* Set up left side. */
@@ -3761,7 +3773,7 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas)
 	/* Copy l_mas and store the value in b_node. */
 	mas_store_b_node(&l_wr_mas, &b_node, l_mas.end);
 	/* Copy r_mas into b_node. */
-	if (r_mas.offset <= r_mas.end)
+	if (r_populated && r_mas.offset <= r_mas.end)
 		mas_mab_cp(&r_mas, r_mas.offset, r_mas.end,
 			   &b_node, b_node.b_end + 1);
 	else
--
2.46.2




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux