Am Freitag, dem 04.10.2024 um 23:41 +0100 schrieb Lorenzo Stoakes: > On Fri, Oct 04, 2024 at 11:35:44AM +0200, Bert Karwatzki wrote: > > Here's the log procduced by this kernel: > > > > c9e7f76815d3 (HEAD -> maple_tree_debug_4) hack: set of info stuff v5 > > 7e3bb072761a mm: correct error handling in mmap_region() > > 77df9e4bb222 (tag: next-20241001, origin/master, origin/HEAD, master) Add linux-next specific files for 20241001 > > > > Again it took two attempts to trigger the bug. > > > > Bert Karwatzki > > > > Sending an updated, cleaned up version of the patch with a lot of > explanation. This is functionally identical to the v3 fix I already sent so > you can try that or this to confirm it resolves your issue. > > If you are able to do so, I can submit this to upstream for a hotfix. If > not, well then back to the drawing board and I'd be very shocked :) > > I have been able to reproduce the issue locally in our userland testing > suite entirely consistently, and this patch resolves the issue and also > continues to pass all maple tree unit tests. > > Again thank you so much for all your help - I hope you are able to find a > spare moment to quickly give this one a try and confirm whether it does > indeed address the problem you've reported. > > Thanks, Lorenzo > > ----8<---- > From 126d65bd9839cd3ec941007872b357e27fd56066 Mon Sep 17 00:00:00 2001 > From: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> > Date: Fri, 4 Oct 2024 15:18:58 +0100 > Subject: [PATCH] maple_tree: correct tree corruption on spanning store > > Writing a data range into a maple tree may involve overwriting a number of > existing entries that span across more than one node. Doing so invokes a > 'spanning' store. > > Performing a spanning store across two leaf nodes in a maple tree in which > entries are overwritten is achieved by first initialising a 'big' node, > which will store the coalesced entries between the two nodes comprising > entries prior to the newly stored entry, the newly stored entry, and > subsequent entries. > > This 'big node' is then merged back into the tree and the tree is > rebalanced, replacing the entries across the spanned nodes with those > contained in the big node. > > The operation is performed in mas_wr_spanning_store() which starts by > establishing two maple tree state objects ('mas' objects) on the left of > the range and on the right (l_mas and r_mas respectively). > > l_mas traverses to the beginning of the range to be stored in order to copy > the data BEFORE the requested store into the big node. > > We then insert our new entry immediately afterwards (both the left copy and > the storing of the new entry are combined and performed by > mas_store_b_node()). > > r_mas traverses to the populated slot immediately after, in order to copy > the data AFTER the requested store into the big node. > > This copy of the right-hand node is performed by mas_mab_cp() as long as > r_mas indicates that there's data to copy, i.e. r_mas.offset <= r_mas.end. > > We traverse r_mas to this position in mas_wr_node_walk() using a simple > loop: > > while (offset < count && mas->index > wr_mas->pivots[offset]) > offset++; > > Note here that count is determined to be the (inclusive) index of the last > node containing data in the node as determined by ma_data_end(). > > This means that even in searching for mas->index, which will have been set > to one plus the end of the target range in order to traverse to the next > slot in mas_wr_spanning_store(), we will terminate the iteration at the end > of the node range even if this condition is not met due to the offset < > count condition. > > The fact this right hand node contains the end of the range being stored is > why we are traversing it, and this loop is why we appear to discover a > viable range within the right node to copy to the big one. > > However, if the node that r_mas traverses contains a pivot EQUAL to the end > of the range being stored, and this is the LAST pivot contained within the > node, something unexpected happens: > > 1. The l_mas traversal copy and insertion of the new entry in the big node > is performed via mas_store_b_node() correctly. > > 2. The traversal performed by mas_wr_node_walk() means our r_mas.offset is > set to the offset of the entry equal to the end of the range we store. > > 3. We therefore copy this DUPLICATE of the final pivot into the big node, > and insert this DUPLICATE entry, alongside its invalid slot entry > immediately after the newly inserted entry. > > 4. The big node containing this duplicated is inserted into the tree which > is rebalanced, and therefore the maple tree becomes corrupted. > > Note that if the right hand node had one or more entries with pivots of > greater value than the end of the stored range, this would not happen. If > it contained entries with pivots of lesser value it would not be the right > node in this spanning store. > > This appears to have been at risk of happening throughout the maple tree's > history, however it seemed significantly less likely to occur until > recently. > > The balancing of the tree seems to have made it unlikely that you would > happen to perform a store that both spans two nodes AND would overwrite > precisely the entry with the largest pivot in the right-hand node which > contains no further larger pivots. > > The work performed in commit f8d112a4e657 ("mm/mmap: avoid zeroing vma tree > in mmap_region()") seems to have made the probability of this event much > more likely. > > Previous to this change, MAP_FIXED mappings which were overwritten would > first be cleared before any subsequent store or importantly - merge of > surrounding entries - would be performed. > > After this change, this is no longer the case, and this means that, in the > worst case, a number of entries might be overwritten in combination with a > merge (and subsequent overwriting expansion) between both the prior entry > AND a subsequent entry. > > The motivation for this change arose from Bert Karwatzki's report of > encountering mm instability after the release of kernel v6.12-rc1 which, > after the use of CONFIG_DEBUG_VM_MAPLE_TREE and similar configuration > options, was identified as maple tree corruption. > > After Bert very generously provided his time and ability to reproduce this > event consistently, I was able to finally identify that the issue discussed > in this commit message was occurring for him. > > The solution implemented in this patch is: > > 1. Adjust mas_wr_walk_index() to return a boolean value indicating whether > the containing node is actually populated with entries possessing pivots > equal to or greater than mas->index. > > 2. When traversing the right node in mas_wr_spanning_store(), use this > value to determine whether to try to copy from the right node - if it is > not populated, then do not do so. > > This passes all maple tree unit tests and resolves the reported bug. > --- > lib/maple_tree.c | 20 ++++++++++++++++---- > 1 file changed, 16 insertions(+), 4 deletions(-) > > diff --git a/lib/maple_tree.c b/lib/maple_tree.c > index 37abf0fe380b..e6f0da908ba7 100644 > --- a/lib/maple_tree.c > +++ b/lib/maple_tree.c > @@ -2194,6 +2194,8 @@ static inline void mas_node_or_none(struct ma_state *mas, > > /* > * mas_wr_node_walk() - Find the correct offset for the index in the @mas. > + * If @mas->index cannot be found within the containing > + * node, we traverse to the last entry in the node. > * @wr_mas: The maple write state > * > * Uses mas_slot_locked() and does not need to worry about dead nodes. > @@ -3527,6 +3529,12 @@ static bool mas_wr_walk(struct ma_wr_state *wr_mas) > return true; > } > > +/* > + * Traverse the maple tree until the offset of mas->index is reached. > + * > + * Return: Is this node actually populated with entries possessing pivots equal > + * to or greater than mas->index? > + */ > static bool mas_wr_walk_index(struct ma_wr_state *wr_mas) > { > struct ma_state *mas = wr_mas->mas; > @@ -3535,8 +3543,11 @@ static bool mas_wr_walk_index(struct ma_wr_state *wr_mas) > mas_wr_walk_descend(wr_mas); > wr_mas->content = mas_slot_locked(mas, wr_mas->slots, > mas->offset); > - if (ma_is_leaf(wr_mas->type)) > - return true; > + if (ma_is_leaf(wr_mas->type)) { > + unsigned long pivot = wr_mas->pivots[mas->offset]; > + > + return pivot == 0 || mas->index <= pivot; > + } > mas_wr_walk_traverse(wr_mas); > > } > @@ -3696,6 +3707,7 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas) > struct maple_big_node b_node; > struct ma_state *mas; > unsigned char height; > + bool r_populated; > > /* Left and Right side of spanning store */ > MA_STATE(l_mas, NULL, 0, 0); > @@ -3737,7 +3749,7 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas) > r_mas.last++; > > r_mas.index = r_mas.last; > - mas_wr_walk_index(&r_wr_mas); > + r_populated = mas_wr_walk_index(&r_wr_mas); > r_mas.last = r_mas.index = mas->last; > > /* Set up left side. */ > @@ -3761,7 +3773,7 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas) > /* Copy l_mas and store the value in b_node. */ > mas_store_b_node(&l_wr_mas, &b_node, l_mas.end); > /* Copy r_mas into b_node. */ > - if (r_mas.offset <= r_mas.end) > + if (r_populated && r_mas.offset <= r_mas.end) > mas_mab_cp(&r_mas, r_mas.offset, r_mas.end, > &b_node, b_node.b_end + 1); > else > -- > 2.46.2 I just tested this and it passed ten tests (i.e. upgrading the proton version i steam) in a row. Bert Karwatzki