On Fri, Jan 24, 2025 at 01:19:45PM -0500, Liam R. Howlett wrote: > Adding Oleg, who I forgot on the start of this patch.. > > > * Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> [250124 06:15]: > > On Thu, Jan 23, 2025 at 03:58:49PM -0500, Liam R. Howlett wrote: > > > From: "Liam R. Howlett" <Liam.Howlett@xxxxxxxxxx> > > > > > > In the even that there is a failure during dup_mmap(), the maple tree > > > can be left in an unsafe state for other iterators besides the exit > > > path. > > > > > > The unsafe state is created after the tree is cloned, but before the > > > vmas are replaced; if a vma allocation fails (for instance), then the > > > tree will have a marker (XA_ZERO_ENTRY) to denote where to stop > > > destroying vmas on the exit path. This marker replaces a vma in the > > > tree and may be treated as a pointer to a vma in iterators besides the > > > special case exit_mmap() iterator. > > > > Thanks for addresing this. > > Thanks for looking at the patch. > > > > > In the original issue, the problem arose in vm_area_dup() which I tracked > > down to vma_iter_load() -> mas_walk() (see [0]). > > > > I'm not sure if your change would actually have addressed this? Would the > > MMF_UNSTABLE flag prevent this code path from being taken (if so where?), > > if not don't we need to add a check for this somewhere? > > > > [0]:https://lore.kernel.org/all/aa2c1930-becc-4bc5-adfb-96e88290acc7@lucifer.local/ > > Remote vma accesses use check_stable_address_space() to avoid races > with the oom killer, which checks MMF_UNSTABLE. > > Yeah, uprobe needs to do this and doesn't today. > > Oleg suggested this be done in build_map_info(), but the MMF_UNSTABLE > bit is set under the mmap read lock and that function uses the read lock > on the rmap, takes an mm references and records the address - then > revisits the mm later to get the vma in register_for_each_vma(). > > I think we need to check in register_for_each_vma(), prior to looking up > the vma. This way we hold the write lock on the mm and will ensure the > vma is stable from both the fork setup and a potential OOM events. The > concern is that we have an invalid vma or the prev/next is invalid and > that will only be a concern when operating on the vma itself, so this > would be the safest place imo. > > The problem that has been created is that the next vma may be invalid. > This arises when the rmap is used to find a vma then changes that vma. > In this scenario, the mmap lock of the vma needs to be held and that mm > needs to be checked as stable/not oom'ed. So we can limit the checking > of the code to the scenario where another mm's vma is found and > modified - which may cause a merge and look at the next vma. > > I haven't found anywhere else that looks up another mm_struct's vma and > makes a change to the vma. Considering what that would mean, it makes > sense that this is rare. > > So, I'll just do this: > > --- a/kernel/events/uprobes.c > +++ b/kernel/events/uprobes.c > @@ -1260,6 +1260,9 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new) > * returns NULL in find_active_uprobe_rcu(). > */ > mmap_write_lock(mm); > + if (check_stable_address_space(mm) > + goto unlock; > + > > Yeah I think on reflection this is sensible, at least as a proximate solution. It certainly doesn't harm to apply the MMF_UNSTABLE flag on top of this and we can always expand where we check/what we do with this. > > > > > > > > > > All the locks are dropped before the exit_mmap() call, but the > > > incomplete mm_struct can be reached through (at least) the rmap finding > > > the vmas which have a pointer back to the mm_struct. > > > > > > Up to this point, there have been no issues with being able to find an > > > mm_sturct that was only partially initialised. Syzbot was able to make > > > the incomplete mm_struct fail with recent forking changes, so it has > > > been proven unsafe to use the mm_sturct that hasn't been initialised, as > > > referenced in the link below. > > > > > > Although 8ac662f5da19f ("fork: avoid inappropriate uprobe access to > > > invalid mm") fixed the uprobe access, it does not completely remove the > > > race. > > > > > > This patch sets the MMF_OOM_SKIP to avoid the iteration of the vmas on > > > the oom side (even though this is extremely unlikely to be selected as > > > an oom victim in the race window), and sets MMF_UNSTABLE to avoid other > > > potential users from using a partially initialised mm_struct. > > > > Good to have it set also I think. > > > > We were concerned that _somehow_ MMF_UNSTABLE might cause some issue in > > OOM, which was why when I first thought of this we decided maybe not just > > to be safe, but I am confident this will not be an issue as indicating that > > something is about to be torn down and is unstable when it is in fact > > unstable and about to be torn down is correct, plus I checked all the paths > > where this impacted and found no issues. > > > > And in any case, setting this flag avoids the problem...! > > > > > > > > Link: https://lore.kernel.org/all/6756d273.050a0220.2477f.003d.GAE@xxxxxxxxxx/ > > > Fixes: d240629148377 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") > > > Cc: Jann Horn <jannh@xxxxxxxxxx> > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> > > > Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > > > Cc: Michal Hocko <mhocko@xxxxxxxx> > > > Cc: Peng Zhang <zhangpeng.00@xxxxxxxxxxxxx> > > > Signed-off-by: Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> > > > --- > > > kernel/fork.c | 17 ++++++++++++++--- > > > 1 file changed, 14 insertions(+), 3 deletions(-) > > > > > > diff --git a/kernel/fork.c b/kernel/fork.c > > > index ded49f18cd95c..20b2120f019ca 100644 > > > --- a/kernel/fork.c > > > +++ b/kernel/fork.c > > > @@ -760,7 +760,8 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > mt_set_in_rcu(vmi.mas.tree); > > > ksm_fork(mm, oldmm); > > > khugepaged_fork(mm, oldmm); > > > - } else if (mpnt) { > > > + } else { > > > + > > > /* > > > * The entire maple tree has already been duplicated. If the > > > * mmap duplication fails, mark the failure point with > > > @@ -768,8 +769,18 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > * stop releasing VMAs that have not been duplicated after this > > > * point. > > > */ > > > - mas_set_range(&vmi.mas, mpnt->vm_start, mpnt->vm_end - 1); > > > - mas_store(&vmi.mas, XA_ZERO_ENTRY); > > > + if (mpnt) { > > > + mas_set_range(&vmi.mas, mpnt->vm_start, mpnt->vm_end - 1); > > > + mas_store(&vmi.mas, XA_ZERO_ENTRY); > > > + /* Avoid OOM iterating a broken tree */ > > > + set_bit(MMF_OOM_SKIP, &mm->flags); > > > + } > > > + /* > > > + * The mm_struct is going to exit, but the locks will be dropped > > > + * first. Set the mm_struct as unstable is advisable as it is > > > + * not fully initialised. > > > + */ > > > + set_bit(MMF_UNSTABLE, &mm->flags); > > > > Do we still need to do this if !mpnt? > > No, we probably don't need this now. But it still makes sense to do > this considering what we are saying. There is a failed mm, let's mark > it as unstable because it's going away in a moment - it's not stable, > it's not even a complete mm. > > > > > Also 'mpnt' is in the running for the worst variable name in mm... > > > > mpnt is the oldMm vma Pointer, Now isn'T it? I think that's self > documenting. :))) > > Thanks, > Liam