Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED

Ankur Arora <ankur.a.arora@xxxxxxxxxx> · Fri, 08 Sep 2023 23:39:47 -0700

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> writes:

> On Fri, 8 Sept 2023 at 15:50, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>> >
>> > which actually makes me worry about the nested irq case, because this
>> > would *not* be ok:
>> >
>> >         allow_resched();
>> >            -> irq happens
>> >                 -> *nested* irq happens
>> >                 <- nested irq return (and preemption)
>> >
>> > ie the allow_resched() needs to still honor the irq count, and a
>> > nested irq return obviously must not cause any preemption.
>>
>> I think we killed nested interrupts a fair number of years ago, but I'll
>> recheck -- but not today, sleep is imminent.
>
> I don't think it has to be an interrupt. I think the TIF_ALLOW_RESCHED
> thing needs to look out for any nested exception (ie only ever trigger
> if it's returning to the kernel "task" stack).
>
> Because I could easily see us wanting to do "I'm going a big user
> copy, it should do TIF_ALLOW_RESCHED, and I don't have preemption on",
> and then instead of that first "irq happens", you have "page fault
> happens" instead.
>
> And inside that page fault handling you may well have critical
> sections (like a spinlock) that is fine - but the fact that the
> "process context" had TIF_ALLOW_RESCHED most certainly does *not* mean
> that the page fault handler can reschedule.
>
> Maybe it already does. As mentioned, I lost sight of the patch series,
> even though I saw it originally (and liked it - only realizing on your
> complaint that it migth be more dangerous than I thought).
>
> Basically, the "allow resched" should be a marker for a single context
> level only. Kind of like a register state bit that gets saved on the
> exception stack. Not a "anything happening within this process is now
> preemptible".

Yeah, exactly. Though, not even a single context level, but a flag
attached to a single context at the process level only. Using
preempt_count() == 0 as the preemption boundary.

However, this has a problem with the PREEMPT_COUNT=n case because that
doesn't have a preemption boundary.

In the example that Peter gave:

        allow_resched();
        spin_lock();
            -> irq happens
            <- irq returns

            ---> preemption happens
        spin_unlock();
        disallow_resched();

So, here the !preempt_count() clause in raw_irqentry_exit_cond_resched()
won't protect us.

My thinking was to restrict allow_resched() to be used only around
primitive operations. But, I couldn't think of any way to enforce that.

I think the warning in preempt_count_add() as Peter suggested
upthread is a good idea. But, that's only for CONFIG_DEBUG_PREEMPT.

--
ankur