Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> writes: > On Fri, 8 Sept 2023 at 15:50, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: >> > >> > which actually makes me worry about the nested irq case, because this >> > would *not* be ok: >> > >> > allow_resched(); >> > -> irq happens >> > -> *nested* irq happens >> > <- nested irq return (and preemption) >> > >> > ie the allow_resched() needs to still honor the irq count, and a >> > nested irq return obviously must not cause any preemption. >> >> I think we killed nested interrupts a fair number of years ago, but I'll >> recheck -- but not today, sleep is imminent. > > I don't think it has to be an interrupt. I think the TIF_ALLOW_RESCHED > thing needs to look out for any nested exception (ie only ever trigger > if it's returning to the kernel "task" stack). > > Because I could easily see us wanting to do "I'm going a big user > copy, it should do TIF_ALLOW_RESCHED, and I don't have preemption on", > and then instead of that first "irq happens", you have "page fault > happens" instead. > > And inside that page fault handling you may well have critical > sections (like a spinlock) that is fine - but the fact that the > "process context" had TIF_ALLOW_RESCHED most certainly does *not* mean > that the page fault handler can reschedule. > > Maybe it already does. As mentioned, I lost sight of the patch series, > even though I saw it originally (and liked it - only realizing on your > complaint that it migth be more dangerous than I thought). > > Basically, the "allow resched" should be a marker for a single context > level only. Kind of like a register state bit that gets saved on the > exception stack. Not a "anything happening within this process is now > preemptible". Yeah, exactly. Though, not even a single context level, but a flag attached to a single context at the process level only. Using preempt_count() == 0 as the preemption boundary. However, this has a problem with the PREEMPT_COUNT=n case because that doesn't have a preemption boundary. In the example that Peter gave: allow_resched(); spin_lock(); -> irq happens <- irq returns ---> preemption happens spin_unlock(); disallow_resched(); So, here the !preempt_count() clause in raw_irqentry_exit_cond_resched() won't protect us. My thinking was to restrict allow_resched() to be used only around primitive operations. But, I couldn't think of any way to enforce that. I think the warning in preempt_count_add() as Peter suggested upthread is a good idea. But, that's only for CONFIG_DEBUG_PREEMPT. -- ankur