Re: [PATCH] maple_tree: Fix sparse reported issues

Liam Howlett <liam.howlett@xxxxxxxxxx> · Mon, 18 Jul 2022 13:45:47 +0000

* Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> [220718 08:56]:
> * Hugh Dickins <hughd@xxxxxxxxxx> [220718 00:28]:
> > On Mon, 18 Jul 2022, Liam Howlett wrote:
> > > * Hugh Dickins <hughd@xxxxxxxxxx> [220717 16:58]:
> > > > On Fri, 15 Jul 2022, Liam Howlett wrote:
> > > > > 
> > > > > Please find attached the last outstanding fix for this series.  It
> > > > > covers a rare failure scenario that one of the build bots reported.
> > > > > Ironically, it changes __vma_adjust() more than the rest of the series,
> > > > > but leaves the locking in the existing order.
> > > > 
> > > > Thanks, I folded that in to my testing on next-20220715, along with
> > > > other you posted on Friday (mas_dead_leaves() walk fix);
> > > 
> > >   Please drop that patch, it needs more testing.
> > 
> > Drop the mas_dead_leaves() walk fix, or the __vma_adjust() changes
> > which you attached in that mail to me? please let me know a.s.a.p,
> > since I cannot proceed without knowing which.
> 
> Drop the mas_dead_leaves() walk fix please.  I responded to the patch
> that it's not tested enough.  I'll respond to the rest of this email
> soon.

So the mas_dead_leaves() patch will most likely produce memory leaks on
big-endian systems.

> 
> > 
> > > 
> > > > but have not
> > > > even glanced at the v11 you posted Saturday, though I see from responses
> > > > that v11 has some other changes, including __vma_adjust() again, but I
> > > > just have not looked.  (I've had my own experiments in __vma_adjust()).
> > > > 
> > > > You'll be wanting my report, I'll give it here rather than in the v11
> > > > thread.  In short, some progress, but still frustratingly none on the
> > > > main crashes.
> > > > 
> > > > 1. swapops.h BUG on !PageLocked migration entry.  This is *not* the
> > > > most urgent to fix, I'm just listing it first to get it out of the way
> > > > here.  This is the BUG that terminates every tmpfs swapping load test
> > > > on the laptop: only progress was that, just one time, the workstation
> > > > hit it before hitting its usual problems: nice to see it there too.
> > > > I'll worry about this bug when the rest is running stably.  I've only
> > > > ever noticed it when it's brk being unmapped, I expect that's a clue.
> > > 
> > > Thanks for pointing me towards a usable reproducer.  I should be able to
> > > narrow it down from there, especially with the brk hint.
> > 
> > I'm afraid I cannot point you to a good reproducer; but anyway, the BUG
> > necessarily appears some time after whatever code is at fault has been
> > exercised, so it needs thought rather than any reproducer.  It was not
> > obvious to me, but I'll think it through again, once the other issues
> > are resolved.
> > 
> > > 
> > > > 
> > > > 2. Silly in do_mas_mumap():
> > > > --- a/mm/mmap.c
> > > > +++ b/mm/mmap.c
> > > > @@ -2513,7 +2513,7 @@ int do_mas_munmap(struct ma_state *mas, struct mm_struct *mm,
> > > >  	arch_unmap(mm, start, end);
> > > >  
> > > >  	/* Find the first overlapping VMA */
> > > > -	vma = mas_find(mas, end - 1);
> > > > +	vma = mas_find(mas, start);
> > > >  	if (!vma)
> > > >  		return 0;
> > > >  
> > > > Fixing that does fix some bad behaviors seen - I'd been having crashes in
> > > > unmap_vmas() and unlink_anon_vmas(), and put "if (WARN_ON(!vma)) return;"
> > > > in unmap_region(); but that no longer seems necessary now do_mas_munmap()
> > > > is fixed thus.  I had high hopes that it would fix all the rest, but no.
> > > 
> > > This is actually correct.  mas_find() is not the same as vma_find().
> > > mas_find() uses the maple state index and searches until a limit (end
> > > -1 in this case).  I haven't seen these crashes, but I think you are
> > > hitting the same issue you are discussing in #6 below.  I also hadn't
> > > realised the potential confusion of those APIs.
> > 
> > You're right, I'm wrong, sorry about that.  But then I'm left with the
> > conundrum of how a class of crashes went away when I changed that.  Did
> > I break it all so badly that it couldn't reach the anon_vma bugs I was
> > hitting before?  Or did making that change coincide with my moving from 
> > DEBUG_MAPLE off to on, so different crashes came sooner?  Or did I fold
> > in another fix from you around that time?  I don't know (and I don't
> > expect you to answer this!), but I've got some backtracking to do.

I expect it is because your search skipped the badness inserted by the
bug from #6.  I would think the workload that this was crashing on would
split the nodes in a certain way that bad VMAs ended up ahead of the
correct data?

> > 
> > > 
> > > > 
> > > > 3. vma_adjust_trans_huge().  Skip this paragraph, I think there
> > > > is actually no problem here, but for safety I have rearranged the
> > > > vma_adjust_trans_huge()s inside the rmap locks, because when things
> > > > aren't working right, best to rule out some possibilities.  Why am
> > > > I even mentioning it here?  In case I send any code snippets and
> > > > you're puzzled by vma_adjust_trans_huge() having moved.
> > > 
> > > Thanks, I will keep this in mind.
> > > 
> > > > 
> > > > 4. unlink_anon_vmas() in __vma_adjust().  Not urgent to fix (can only
> > > > be an issue when GFP_KERNEL preallocation fails, which I think means
> > > > when current is being OOM-killed), but whereas vma_expand() has careful
> > > > anon_cloned flag, __vma_adjust() does not, and I think could be
> > > > unlinking a pre-existing anon_vma.  Aside from that, I am nervous about
> > > > using unlink_anon_vmas() there like that (and in vma_expand()): IIUC it's
> > > > an unnecessary "optimization" for a very unlikely case, that's in danger
> > > > of doing more harm than good; and should be removed from them both (then
> > > > they can both simply return -ENOMEM when mas_preallocate() fails).
> > > 
> > > I will add a flag to __vma_adjust, but I don't see how it could happen.
> > > I guess if importer has an anon_vma already?  I added these as an unwind
> > > since I don't want to leak - even on the rare preallocation failure.  If
> > > you don't want to unroll, what would you suggest in these cases?  Would
> > > a flag be acceptable?
> > 
> > I cannot see what purpose adding a flag to __vma_adjust() would serve.
> > If importer had anon_vma already, yes, without checking the code again,
> > that was I think what I had in mind.  But (correct me if you've observed
> > that I'm wrong) there's no question of a leak there: a vma which wasn't
> > given an anon_vma before gets linked in to one, and it will all get torn
> > down correctly when the process exits (indeed, when OOM kill completes).
> > 
> > It's "nice" to delay setting anon_vma until it's needed, but not worth
> > any effort to rewind it (especially on such an unlikely path): and
> > normally, once anon_vma has been set, it stays set - I'm not sure of
> > the consequence of unsetting it again (racing with rmap lookups: may
> > be okay because of how anon_vma is not trusted when page is not mapped;
> > but it's easier just to omit the rewind than think all that through).

So the only time I've even seen __vma_adjust() fail is with a fault
injector failing mas_preallocate() allocations.  If it's safe to not
unwind, I'm happy to drop both unwinds but I was concerned in the path
of a vma_merge() calling __vma_adjust() and failing out on allocations
then OOM recovering, leaving a VMA with a 1/2 merged vma with anon
incorrectly set.. which is an even more unlikely scenario.

> > 
> > > 
> > > > 
> > > > 5. I was horrified/thrilled to notice last night that mas_store_prealloc()
> > > > does a mas_destroy() at the end.  So only the first vma_mas_store() in
> > > > __vma_adjust() gets to use the carefully preallocated nodes.  I thought
> > > > that might be responsible for lots of nastiness, but again sadly no.
> > > > Maybe it just falls back to GFP_ATOMIC when the preallocateds are gone
> > > > (I didn't look) and that often works okay.  Whether the carefully
> > > > chosen prealloc count allows for __vma_adjust() use would be another
> > > > question.  (Earlier on I did try doubling its calculation, in case it
> > > > was preallocating too few, but that made no difference either.)
> > > 
> > > mas_store_prealloc will allocate for the worst case scenario.  Since I
> > > cannot guarantee the second store isn't the worst case, and could fail,
> > > I cannot allow for more than one allocation per preallocate.  I thought
> > > I was fine in __vma_adjust since I preallocate after the goto label for
> > > a second removal but it turns out I wasn't since the second preallocate
> > > could fail, so I've removed the requirement for a second store for 'case
> > > 6' by updating the tree once and removing both VMAs in a single pass.
> > > There could, potentially be an issue if the caller to __vma_merge()
> > > wanted to reduce one limit of the VMA (I guess just the start..) and
> > > also remove one or more VMAs, and also be in a situation where
> > > allocations could cause issues (fs_reclaim).. and since __vma_adjust()
> > > is so complicated, I looked at the callers.  Most use vma_merge(), and
> > > those seem fine.  fs/exec only adjusts one at a time.  when splitting,
> > > only a single insert happens.  Did I miss some situation(s)?
> > 
> > I don't think the fs/exec stack moving will be any worry for this.
> > Did you miss any case?  Yes, the "insert" cases from __split_vma().
> > I have no appreciation of when maple tree actually needs to make an
> > allocation, so I don't know whether the adjust_next case ever needs
> > to make an allocation, but I'd have thought the insert case might
> > need to sometimes.

Right, the __split_vma() never adjusts anything but one side of the
'vma' VMA by inserting the 'insert' VMA.  This will result in two writes
to the tree - but one will exactly fit in an existing range which will
be placed without an allocation via the mas_wr_slot_store() function in
the maple tree.  Exact fits are nice - they are fast.

> > 
> > But I'll admit to skimming your paragraph there, I'm more concerned
> > to go on to the following issue than fully grasp your argument above.
> > 
> > > 
> > > > 
> > > > 6. The main problem, crashes on the workstation (never seen on the
> > > > laptop).  I keep hacking around with the debug info (and, please,
> > > > %px not %p, %lx not %lu in the debug info: I've changed them all,
> > > 
> > > Okay, so I tried to remove all %px in the debug code so I'll revert
> > > those.  I use %lu for historic reasons from mt_dump(), I can change
> > > those too,  The tree uses ranges to store pointers so I print the
> > > pointers in %lx and the ranges in %lu.
> > > 
> > > 
> > > > and a couple of %lds, in my copy of lib/maple_tree.c).  I forget
> > > > how the typical crashes appeared with the MAPLE_DEBUGs turned off
> > > > (the BUG_ON(count != mm->map_count) in exit_mmap() perhaps); I've
> > > > turned them on, but usually comment out the mt_validate() and
> > > > mt_dump(), which generate far too much noise for non-specialists,
> > > > and delay the onset of crash tenfold (but re-enabled to give you
> > > > the attached messages.xz).
> > > > 
> > > > Every time, it's the cc1 (9.3.1) doing some munmapping (cc1 is
> > > > mostly what's running in this load test, so that's not surprising;
> > > > but surprising that even when I switched the laptop to using same
> > > > gcc-9, couldn't reproduce the problem there). Typically, that
> > > > munmapping has involved splitting a small, say three page, vma
> > > > into two pages below and one page above (the "insert", and I've
> > > > hacked the debug to dump that too, see "mmap: insert" - ah,
> > > > looking at the messages now, the insert is bigger this time).
> > > > 
> > > > And what has happened each time I've studied it (I don't know
> > > > if this is evident from the mt dumps in the messages, I hope so)
> > > > is that the vma and the insert have been correctly placed in the
> > > > tree, except that the vma has *also* been placed several pages
> > > > earlier, and a linear search of the tree finds that misplaced
> > > > instance first, vm_start not matching mt index.
> > > > 
> > > > The map_count in these cases has always been around 59, 60, 61:
> > > > maybe just typical for cc1, or maybe significant for maple tree?
> > > > 
> > > > I won't give up quite yet, but I'm hoping you'll have an idea for
> > > > the misplaced tree entry.  Something going wrong in rebalancing?
> > > > 
> > > > For a long time I assumed the problem would be at the mm/mmap.c end,
> > > > and I've repeatedly tried different things with the vma_mas stuff
> > > > in __vma_adjust() (for example, using vma_mas_remove() to remove
> > > > vmas before re-adding them, and/or doing mas_reset() in more places);
> > > > but none of those attempts actually fixed the issue.  So now I'm
> > > > thinking the problem is really at the lib/maple_tree.c end.
> > > > 
> > > 
> > > Do you have the patch
> > > "maple_tree-Fix-stale-data-copy-in-mas_wr_node_store.patch"? It sounds
> > > like your issue fits this fix exactly. I was seeing the same issue with
> > > gcc 9.3.1 20200408 and this bug doesn't happen for me now.  The logs
> > > you sent also fit the situation. I went through the same exercise
> > > (exorcism?) of debugging the various additions and removals of the VMA
> > > only to find the issue in the tree itself.  The fix also modified the
> > > test code to detect the issue - which was actually hit but not detected
> > > in the existing test cases from a live capture of VMA activities.  It is
> > > difficult to spot in the tree dump as well.  I am sure I sent this to
> > > Andrew as it is included in v11 and did not show up in his diff, but I
> > > cannot find it on lore, perhaps I forgot to CC you?  I've attached it
> > > here for you in case you missed it.
> > 
> > Thanks!  No, I never received that patch, nor can I see it on lore
> > or marc.info; but I (still) haven't looked at v11, and don't know
> > about Andrew's diff.  Anyway, sounds exciting, I'm eager to stop
> > writing this mail and get to testing with that in - but please
> > let me know whether it's the mas_dead_leaves() or the __vma_adjust()
> > mods you attached previously, which you want me to leave out.

Sorry, I wandered off after the previous email.  It was the
mas_dead_leaves() patch I was referencing to drop.  I sent it out before
ensuring it was safe on all architectures and one - arm64 or s390
probably, wasn't happy with that change.

> > 
> > > 
> > > You are actually hitting the issue several iterations beyond when it
> > > first occurred.  It was introduced earlier in the tree and exposed with
> > > your other operations by means of a node split or merge.
> > > 
> > > > 7. If you get to do cleanups later, please shrink those double blank
> > > > lines to single blank lines.  And find a better name for the strange
> > > > vma_mas_szero() - vma_mas_erase(), or is erase something different?
> > > > I'm not even sure that it's needed: that's a special case for exec's
> > > > shift_arg_pages() VM_STACK_INCOMPLETE_SETUP, which uses __vma_adjust()
> > > > in a non-standard way.  And trace_vma_mas_szero() in vma_mas_remove()
> > > > looks wrong.
> > > 
> > > Okay, I'll be sure to only have one blank line.  Where do you see this?
> > > I would have thought that checkpatch would complain?  I did try to,
> > 
> > No, I'm not going to search for those double blank lines now:
> > please do your own diff and look through for them.  And I don't know
> > whether checkpatch objects to them or not, but they're bad for patch
> > application, since they increase the likelihood that a patch applies
> > in the wrong place.
> > 
> > As to whether this is or is not a good time for such cleanups,
> > I just don't know: I see Andrew on the one hand intending to drop
> > maple tree for the moment, but Linus on the other hand promising
> > an extra week for 5.19.  I'll just let others decide what they want.
> > 
> > Hugh
> > 
> > > regretfully, address more checkpatch issues on v11.  It added more noise
> > > to the differences of v10 + patches to v11 than anything else.
> > > 
> > > 
> > > Thanks again,
> > > Liam