On Sat, Oct 05, 2024 at 02:56:01AM +0200, Bert Karwatzki wrote: > Am Freitag, dem 04.10.2024 um 23:41 +0100 schrieb Lorenzo Stoakes: > > On Fri, Oct 04, 2024 at 11:35:44AM +0200, Bert Karwatzki wrote: > > > Here's the log procduced by this kernel: > > > > > > c9e7f76815d3 (HEAD -> maple_tree_debug_4) hack: set of info stuff v5 > > > 7e3bb072761a mm: correct error handling in mmap_region() > > > 77df9e4bb222 (tag: next-20241001, origin/master, origin/HEAD, master) Add linux-next specific files for 20241001 > > > > > > Again it took two attempts to trigger the bug. > > > > > > Bert Karwatzki > > > > > > > Sending an updated, cleaned up version of the patch with a lot of > > explanation. This is functionally identical to the v3 fix I already sent so > > you can try that or this to confirm it resolves your issue. > > > > If you are able to do so, I can submit this to upstream for a hotfix. If > > not, well then back to the drawing board and I'd be very shocked :) > > > > I have been able to reproduce the issue locally in our userland testing > > suite entirely consistently, and this patch resolves the issue and also > > continues to pass all maple tree unit tests. > > > > Again thank you so much for all your help - I hope you are able to find a > > spare moment to quickly give this one a try and confirm whether it does > > indeed address the problem you've reported. > > > > Thanks, Lorenzo > > > > ----8<---- > > From 126d65bd9839cd3ec941007872b357e27fd56066 Mon Sep 17 00:00:00 2001 > > From: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> > > Date: Fri, 4 Oct 2024 15:18:58 +0100 > > Subject: [PATCH] maple_tree: correct tree corruption on spanning store > > > > Writing a data range into a maple tree may involve overwriting a number of > > existing entries that span across more than one node. Doing so invokes a > > 'spanning' store. > > > > Performing a spanning store across two leaf nodes in a maple tree in which > > entries are overwritten is achieved by first initialising a 'big' node, > > which will store the coalesced entries between the two nodes comprising > > entries prior to the newly stored entry, the newly stored entry, and > > subsequent entries. > > > > This 'big node' is then merged back into the tree and the tree is > > rebalanced, replacing the entries across the spanned nodes with those > > contained in the big node. > > > > The operation is performed in mas_wr_spanning_store() which starts by > > establishing two maple tree state objects ('mas' objects) on the left of > > the range and on the right (l_mas and r_mas respectively). > > > > l_mas traverses to the beginning of the range to be stored in order to copy > > the data BEFORE the requested store into the big node. > > > > We then insert our new entry immediately afterwards (both the left copy and > > the storing of the new entry are combined and performed by > > mas_store_b_node()). > > > > r_mas traverses to the populated slot immediately after, in order to copy > > the data AFTER the requested store into the big node. > > > > This copy of the right-hand node is performed by mas_mab_cp() as long as > > r_mas indicates that there's data to copy, i.e. r_mas.offset <= r_mas.end. > > > > We traverse r_mas to this position in mas_wr_node_walk() using a simple > > loop: > > > > while (offset < count && mas->index > wr_mas->pivots[offset]) > > offset++; > > > > Note here that count is determined to be the (inclusive) index of the last > > node containing data in the node as determined by ma_data_end(). > > > > This means that even in searching for mas->index, which will have been set > > to one plus the end of the target range in order to traverse to the next > > slot in mas_wr_spanning_store(), we will terminate the iteration at the end > > of the node range even if this condition is not met due to the offset < > > count condition. > > > > The fact this right hand node contains the end of the range being stored is > > why we are traversing it, and this loop is why we appear to discover a > > viable range within the right node to copy to the big one. > > > > However, if the node that r_mas traverses contains a pivot EQUAL to the end > > of the range being stored, and this is the LAST pivot contained within the > > node, something unexpected happens: > > > > 1. The l_mas traversal copy and insertion of the new entry in the big node > > is performed via mas_store_b_node() correctly. > > > > 2. The traversal performed by mas_wr_node_walk() means our r_mas.offset is > > set to the offset of the entry equal to the end of the range we store. > > > > 3. We therefore copy this DUPLICATE of the final pivot into the big node, > > and insert this DUPLICATE entry, alongside its invalid slot entry > > immediately after the newly inserted entry. > > > > 4. The big node containing this duplicated is inserted into the tree which > > is rebalanced, and therefore the maple tree becomes corrupted. > > > > Note that if the right hand node had one or more entries with pivots of > > greater value than the end of the stored range, this would not happen. If > > it contained entries with pivots of lesser value it would not be the right > > node in this spanning store. > > > > This appears to have been at risk of happening throughout the maple tree's > > history, however it seemed significantly less likely to occur until > > recently. > > > > The balancing of the tree seems to have made it unlikely that you would > > happen to perform a store that both spans two nodes AND would overwrite > > precisely the entry with the largest pivot in the right-hand node which > > contains no further larger pivots. > > > > The work performed in commit f8d112a4e657 ("mm/mmap: avoid zeroing vma tree > > in mmap_region()") seems to have made the probability of this event much > > more likely. > > > > Previous to this change, MAP_FIXED mappings which were overwritten would > > first be cleared before any subsequent store or importantly - merge of > > surrounding entries - would be performed. > > > > After this change, this is no longer the case, and this means that, in the > > worst case, a number of entries might be overwritten in combination with a > > merge (and subsequent overwriting expansion) between both the prior entry > > AND a subsequent entry. > > > > The motivation for this change arose from Bert Karwatzki's report of > > encountering mm instability after the release of kernel v6.12-rc1 which, > > after the use of CONFIG_DEBUG_VM_MAPLE_TREE and similar configuration > > options, was identified as maple tree corruption. > > > > After Bert very generously provided his time and ability to reproduce this > > event consistently, I was able to finally identify that the issue discussed > > in this commit message was occurring for him. > > > > The solution implemented in this patch is: > > > > 1. Adjust mas_wr_walk_index() to return a boolean value indicating whether > > the containing node is actually populated with entries possessing pivots > > equal to or greater than mas->index. > > > > 2. When traversing the right node in mas_wr_spanning_store(), use this > > value to determine whether to try to copy from the right node - if it is > > not populated, then do not do so. > > > > This passes all maple tree unit tests and resolves the reported bug. > > --- > > lib/maple_tree.c | 20 ++++++++++++++++---- > > 1 file changed, 16 insertions(+), 4 deletions(-) > > > > diff --git a/lib/maple_tree.c b/lib/maple_tree.c > > index 37abf0fe380b..e6f0da908ba7 100644 > > --- a/lib/maple_tree.c > > +++ b/lib/maple_tree.c > > @@ -2194,6 +2194,8 @@ static inline void mas_node_or_none(struct ma_state *mas, > > > > /* > > * mas_wr_node_walk() - Find the correct offset for the index in the @mas. > > + * If @mas->index cannot be found within the containing > > + * node, we traverse to the last entry in the node. > > * @wr_mas: The maple write state > > * > > * Uses mas_slot_locked() and does not need to worry about dead nodes. > > @@ -3527,6 +3529,12 @@ static bool mas_wr_walk(struct ma_wr_state *wr_mas) > > return true; > > } > > > > +/* > > + * Traverse the maple tree until the offset of mas->index is reached. > > + * > > + * Return: Is this node actually populated with entries possessing pivots equal > > + * to or greater than mas->index? > > + */ > > static bool mas_wr_walk_index(struct ma_wr_state *wr_mas) > > { > > struct ma_state *mas = wr_mas->mas; > > @@ -3535,8 +3543,11 @@ static bool mas_wr_walk_index(struct ma_wr_state *wr_mas) > > mas_wr_walk_descend(wr_mas); > > wr_mas->content = mas_slot_locked(mas, wr_mas->slots, > > mas->offset); > > - if (ma_is_leaf(wr_mas->type)) > > - return true; > > + if (ma_is_leaf(wr_mas->type)) { > > + unsigned long pivot = wr_mas->pivots[mas->offset]; > > + > > + return pivot == 0 || mas->index <= pivot; > > + } > > mas_wr_walk_traverse(wr_mas); > > > > } > > @@ -3696,6 +3707,7 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas) > > struct maple_big_node b_node; > > struct ma_state *mas; > > unsigned char height; > > + bool r_populated; > > > > /* Left and Right side of spanning store */ > > MA_STATE(l_mas, NULL, 0, 0); > > @@ -3737,7 +3749,7 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas) > > r_mas.last++; > > > > r_mas.index = r_mas.last; > > - mas_wr_walk_index(&r_wr_mas); > > + r_populated = mas_wr_walk_index(&r_wr_mas); > > r_mas.last = r_mas.index = mas->last; > > > > /* Set up left side. */ > > @@ -3761,7 +3773,7 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas) > > /* Copy l_mas and store the value in b_node. */ > > mas_store_b_node(&l_wr_mas, &b_node, l_mas.end); > > /* Copy r_mas into b_node. */ > > - if (r_mas.offset <= r_mas.end) > > + if (r_populated && r_mas.offset <= r_mas.end) > > mas_mab_cp(&r_mas, r_mas.offset, r_mas.end, > > &b_node, b_node.b_end + 1); > > else > > -- > > 2.46.2 > > I just tested this and it passed ten tests (i.e. upgrading the proton version i > steam) in a row. > > Bert Karwatzki Perfect :) will send the fix upstream then as a hotfix for 6.12! Thanks very much for helping out with this, your help has been absolutely invaluable and HUGELY appreciated. Cheers, Lorenzo