Writing a data range into a maple tree may involve overwriting a number of existing entries that span across more than one node. Doing so invokes a 'spanning' store. Performing a spanning store across two leaf nodes in a maple tree in which entries are overwritten is achieved by first initialising a 'big' node, which will store the coalesced entries between the two nodes comprising entries prior to the newly stored entry, the newly stored entry, and subsequent entries. This 'big node' is then merged back into the tree and the tree is rebalanced, replacing the entries across the spanned nodes with those contained in the big node. The operation is performed in mas_wr_spanning_store() which starts by establishing two maple tree state objects ('mas' objects) on the left of the range and on the right (l_mas and r_mas respectively). l_mas traverses to the beginning of the range to be stored in order to copy the data BEFORE the requested store into the big node. We then insert our new entry immediately afterwards (both the left copy and the storing of the new entry are combined and performed by mas_store_b_node()). r_mas traverses to the populated slot immediately after, in order to copy the data AFTER the requested store into the big node. This copy of the right-hand node is performed by mas_mab_cp() as long as r_mas indicates that there's data to copy, i.e. r_mas.offset <= r_mas.end. We traverse r_mas to this position in mas_wr_node_walk() using a simple loop: while (offset < count && mas->index > wr_mas->pivots[offset]) offset++; Note here that count is determined to be the (inclusive) index of the last node containing data in the node as determined by ma_data_end(). This means that even in searching for mas->index, which will have been set to one plus the end of the target range in order to traverse to the next slot in mas_wr_spanning_store(), we will terminate the iteration at the end of the node range even if this condition is not met due to the offset < count condition. The fact this right hand node contains the end of the range being stored is why we are traversing it, and this loop is why we appear to discover a viable range within the right node to copy to the big one. However, if the node that r_mas traverses contains a pivot EQUAL to the end of the range being stored, and this is the LAST pivot contained within the node, something unexpected happens: 1. The l_mas traversal copy and insertion of the new entry in the big node is performed via mas_store_b_node() correctly. 2. The traversal performed by mas_wr_node_walk() means our r_mas.offset is set to the offset of the entry equal to the end of the range we store. 3. We therefore copy this DUPLICATE of the final pivot into the big node, and insert this DUPLICATE entry, alongside its invalid slot entry immediately after the newly inserted entry. 4. The big node containing this duplicated is inserted into the tree which is rebalanced, and therefore the maple tree becomes corrupted. Note that if the right hand node had one or more entries with pivots of greater value than the end of the stored range, this would not happen. If it contained entries with pivots of lesser value it would not be the right node in this spanning store. This appears to have been at risk of happening throughout the maple tree's history, however it seemed significantly less likely to occur until recently. The balancing of the tree seems to have made it unlikely that you would happen to perform a store that both spans two nodes AND would overwrite precisely the entry with the largest pivot in the right-hand node which contains no further larger pivots. The work performed in commit f8d112a4e657 ("mm/mmap: avoid zeroing vma tree in mmap_region()") seems to have made the probability of this event much more likely. Previous to this change, MAP_FIXED mappings which were overwritten would first be cleared before any subsequent store or importantly - merge of surrounding entries - would be performed. After this change, this is no longer the case, and this means that, in the worst case, a number of entries might be overwritten in combination with a merge (and subsequent overwriting expansion) between both the prior entry AND a subsequent entry. The motivation for this change arose from Bert Karwatzki's report of encountering mm instability after the release of kernel v6.12-rc1 which, after the use of CONFIG_DEBUG_VM_MAPLE_TREE and similar configuration options, was identified as maple tree corruption. After Bert very generously provided his time and ability to reproduce this event consistently, I was able to finally identify that the issue discussed in this commit message was occurring for him. The solution implemented in this patch is: 1. Adjust mas_wr_walk_index() to return a boolean value indicating whether the containing node is actually populated with entries possessing pivots equal to or greater than mas->index. 2. When traversing the right node in mas_wr_spanning_store(), use this value to determine whether to try to copy from the right node - if it is not populated, then do not do so. This passes all maple tree unit tests and resolves the reported bug. --- lib/maple_tree.c | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 37abf0fe380b..e6f0da908ba7 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -2194,6 +2194,8 @@ static inline void mas_node_or_none(struct ma_state *mas, /* * mas_wr_node_walk() - Find the correct offset for the index in the @mas. + * If @mas->index cannot be found within the containing + * node, we traverse to the last entry in the node. * @wr_mas: The maple write state * * Uses mas_slot_locked() and does not need to worry about dead nodes. @@ -3527,6 +3529,12 @@ static bool mas_wr_walk(struct ma_wr_state *wr_mas) return true; } +/* + * Traverse the maple tree until the offset of mas->index is reached. + * + * Return: Is this node actually populated with entries possessing pivots equal + * to or greater than mas->index? + */ static bool mas_wr_walk_index(struct ma_wr_state *wr_mas) { struct ma_state *mas = wr_mas->mas; @@ -3535,8 +3543,11 @@ static bool mas_wr_walk_index(struct ma_wr_state *wr_mas) mas_wr_walk_descend(wr_mas); wr_mas->content = mas_slot_locked(mas, wr_mas->slots, mas->offset); - if (ma_is_leaf(wr_mas->type)) - return true; + if (ma_is_leaf(wr_mas->type)) { + unsigned long pivot = wr_mas->pivots[mas->offset]; + + return pivot == 0 || mas->index <= pivot; + } mas_wr_walk_traverse(wr_mas); } @@ -3696,6 +3707,7 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas) struct maple_big_node b_node; struct ma_state *mas; unsigned char height; + bool r_populated; /* Left and Right side of spanning store */ MA_STATE(l_mas, NULL, 0, 0); @@ -3737,7 +3749,7 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas) r_mas.last++; r_mas.index = r_mas.last; - mas_wr_walk_index(&r_wr_mas); + r_populated = mas_wr_walk_index(&r_wr_mas); r_mas.last = r_mas.index = mas->last; /* Set up left side. */ @@ -3761,7 +3773,7 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas) /* Copy l_mas and store the value in b_node. */ mas_store_b_node(&l_wr_mas, &b_node, l_mas.end); /* Copy r_mas into b_node. */ - if (r_mas.offset <= r_mas.end) + if (r_populated && r_mas.offset <= r_mas.end) mas_mab_cp(&r_mas, r_mas.offset, r_mas.end, &b_node, b_node.b_end + 1); else -- 2.46.2