Re: [PATCH v9 10/12] x86/mm: do targeted broadcast flushing from tlbbatch code

Brendan Jackman <jackmanb@xxxxxxxxxx> · Tue, 11 Feb 2025 11:02:27 +0100

On Tue, 11 Feb 2025 at 04:50, Rik van Riel <riel@xxxxxxxxxxx> wrote:
>
> On Mon, 2025-02-10 at 16:27 +0100, Brendan Jackman wrote:
> > On Thu, 6 Feb 2025 at 05:46, Rik van Riel <riel@xxxxxxxxxxx> wrote:
> > >  /* Wait for INVLPGB originated by this CPU to complete. */
> > > -static inline void tlbsync(void)
> > > +static inline void __tlbsync(void)
> > >  {
> > > -       cant_migrate();
> >
> > Why does this have to go away?
>
> I'm not sure the current task in sched_init() has
> all the correct bits set to prevent the warning
> from firing, but on the flip side it won't have
> called INVLPGB yet at that point, so the call to
> enter_lazy_tlb() won't actually end up here.
>
> I'll put it back.

Sounds good.

FWIW I think if we do run into early-boot code hitting false
DEBUG_ATOMIC_SLEEP warnings, the best response might be to update the
DEBUG_ATOMIC_SLEEP code. Like maybe there's a more targeted solution
but something roughly equivalent to checking if (system_state ==
SYSTEM_STATE_SCHEDULING) before the warning.

> > > @@ -794,6 +825,8 @@ void switch_mm_irqs_off(struct mm_struct
> > > *unused, struct mm_struct *next,
> > >         if (IS_ENABLED(CONFIG_PROVE_LOCKING))
> > >                 WARN_ON_ONCE(!irqs_disabled());
> > >
> > > +       tlbsync();
> > > +
> > >         /*
> > >          * Verify that CR3 is what we think it is.  This will catch
> > >          * hypothetical buggy code that directly switches to
> > > swapper_pg_dir
> > > @@ -973,6 +1006,8 @@ void switch_mm_irqs_off(struct mm_struct
> > > *unused, struct mm_struct *next,
> > >   */
> > >  void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
> > >  {
> > > +       tlbsync();
> > > +
> >
> > I have a feeling I'll look stupid for asking this, but why do we need
> > this and the one in switch_mm_irqs_off()?
>
> This is an architectural thing: TLBSYNC waits for
> the INVLPGB flushes to finish that were issued
> from the same CPU.
>
> That means if we have pending flushes (from the
> pageout code), we need to wait for them at context
> switch time, before the task could potentially be
> migrated to another CPU.

Oh right thanks, that makes sense.

So I think here we're encoding the assumption that context_switch()
always calls either enter_lazy_tlb() or switch_mm_irqs_off(), which is
a little awkward, plus the job of these functions is already kinda
hazy and this makes it even hazier. What about doing it in
arch_start_context_switch() instead?

That would mean a bit of plumbing since we'd still wanna have the
tlbsync() in tlb.c, but that seems worth it to me. Plus, having it in
one place would give us a spot to add a comment. Now that you point it
out it does indeed seem obvious but it didn't seem so yesterday.

Now I think about it... if we always tlbsync() before a context
switch, is the cant_migrate() above actually required? I think with
that, even if we migrated in the middle of e.g.
broadcast_kernel_range_flush(), we'd be fine? (At least, from the
specific perspective of the invplgb code, presumably having preemption
on there would break things horribly in other ways).