On 2023/06/22 8:24, Tetsuo Handa wrote: > By the way, given > > write_seqlock_irqsave(&zonelist_update_seq, flags); > <<IRQ>> > some_timer_function() { > kmalloc(GFP_ATOMIC); > } > <</IRQ>> > printk_deferred_enter(); > > scenario in CONFIG_PREEMPT_RT=y case is handled by executing some_timer_function() > on a dedicated kernel thread for IRQs, what guarantees that the kernel thread for > IRQs gives up CPU and the user thread which called write_seqlock() gains CPU until > write_sequnlock() is called? How can the kernel figure out that executing the user > thread needs higher priority than the kernel thread? I haven't got response on this question. Several years ago, I demonstrated that a SCHED_IDLE priority userspace thread holding oom_lock causes other concurrently allocating !SCHED_IDLE priority threads to misunderstand that mutex_trylock(&oom_lock) failure implies we are making forward progress (despite the SCHED_IDLE priority userspace thread was unable to wake up for minutes). If a SCHED_IDLE priority thread which called write_seqlock_irqsave() is preempted by some other !SCHED_IDLE priority threads (especially realtime priority threads), and such !SCHED_IDLE priority thread calls kmalloc(GFP_ATOMIC) or printk(), a similar thing (misunderstand that spinning on read_seqbegin() from zonelist_iter_begin() can make forward progress despite a thread which called write_seqlock_irqsave() cannot make progress due to preemption) can happen. Question to Sebastian: To make sure that such thing cannot happen, we should make sure that a thread which entered write_seqcount_begin(&zonelist_update_seq.seqcount) from write_seqlock_irqsave(&zonelist_update_seq, flags) can continue using CPU until write_seqcount_end(&zonelist_update_seq.seqcount) from write_seqlock_irqrestore(&zonelist_update_seq, flags). Does adding preempt_disable() before write_seqlock(&zonelist_update_seq, flags) help? Question to Peter: Even if local_irq_save(flags) disables IRQ, NMI context can enqueue message via printk(). When does the message enqueued from NMI context gets printed? If there is a possibility that the message enqueued from NMI context gets printed between "write_seqlock_irqsave(&zonelist_update_seq, flags) and printk_deferred_enter()" or "printk_deferred_exit() and write_sequnlock_irqrestore(&zonelist_update_seq, flags)" ? If yes, we can't increment zonelist_update_seq.seqcount before printk_deferred_enter()...