Re: [PATCH] mm/page_alloc: Use write_seqlock_irqsave() instead write_seqlock() + local_irq_save().

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu 2023-06-22 16:11:41, Petr Mladek wrote:
> On Thu 2023-06-22 22:36:27, Tetsuo Handa wrote:
> > On 2023/06/22 8:24, Tetsuo Handa wrote:
> > > By the way, given
> > > 
> > >   write_seqlock_irqsave(&zonelist_update_seq, flags);
> > >   <<IRQ>>
> > >     some_timer_function() {
> > >       kmalloc(GFP_ATOMIC);
> > >     }
> > >   <</IRQ>>
> > >   printk_deferred_enter();
> > > 
> > > scenario in CONFIG_PREEMPT_RT=y case is handled by executing some_timer_function()
> > > on a dedicated kernel thread for IRQs, what guarantees that the kernel thread for
> > > IRQs gives up CPU and the user thread which called write_seqlock() gains CPU until
> > > write_sequnlock() is called? How can the kernel figure out that executing the user
> > > thread needs higher priority than the kernel thread?

My understanding is that this is achieved by spin_lock_irqsave(&sl->lock, flags).
When RT is enabled then	rt_spin_lock(lock) is used.

AFAIK, rt_spin_lock(lock) fulfills exactly the above requirements.
The owner could schedule. The waiter could schedule as well so that
they could be running on the same CPU. Also the current owner gets
higher priority when the is a waiter with the higher priority to avoid
the priority inversion.

> > I haven't got response on this question.
> > 
> > Several years ago, I demonstrated that a SCHED_IDLE priority userspace thread holding
> > oom_lock causes other concurrently allocating !SCHED_IDLE priority threads to
> > misunderstand that mutex_trylock(&oom_lock) failure implies we are making forward
> > progress (despite the SCHED_IDLE priority userspace thread was unable to wake up for
> > minutes).
> > 
> > If a SCHED_IDLE priority thread which called write_seqlock_irqsave() is preempted by
> > some other !SCHED_IDLE priority threads (especially realtime priority threads), and
> > such !SCHED_IDLE priority thread calls kmalloc(GFP_ATOMIC) or printk(), a similar thing
> > (misunderstand that spinning on read_seqbegin() from zonelist_iter_begin() can make
> > forward progress despite a thread which called write_seqlock_irqsave() cannot make
> > progress due to preemption) can happen.
> > 
> > Question to Sebastian:
> > To make sure that such thing cannot happen, we should make sure that
> > a thread which entered write_seqcount_begin(&zonelist_update_seq.seqcount) from 
> > write_seqlock_irqsave(&zonelist_update_seq, flags) can continue using CPU until
> > write_seqcount_end(&zonelist_update_seq.seqcount) from
> > write_seqlock_irqrestore(&zonelist_update_seq, flags).
> > Does adding preempt_disable() before write_seqlock(&zonelist_update_seq, flags) help?
> > 
> > Question to Peter:
> > Even if local_irq_save(flags) disables IRQ, NMI context can enqueue message via printk().
> > When does the message enqueued from NMI context gets printed?
> 
> They are flushed to the console either by irq_work or by another
> printk(). The irq_work could not be proceed when IRQs are disabled.
> But another non-deferred printk() would try to flush them immediately.
> 
> > If there is a possibility
> > that the message enqueued from NMI context gets printed between
> > "write_seqlock_irqsave(&zonelist_update_seq, flags) and printk_deferred_enter()" or
> > "printk_deferred_exit() and write_sequnlock_irqrestore(&zonelist_update_seq, flags)" ?
> > If yes, we can't increment zonelist_update_seq.seqcount before printk_deferred_enter()...
> 
> It might happen when a printk() is called in these holes.

I believe that this hole is the only remaining problem. IMHO, the only
solution is to disable migration/preemtion in printk_deferred_enter().

It is not a big deal, really. __rt_spin_lock() would call
migrate_disable() anyway. And nested migrate_disable() is fast.
It just increments the counter when it is not 0.

I would suggest to do this change in printk_deferred_enter() first.
It will allows to call it before write_seqlock_irqsave(). And we
will not need to change the ordering back and forth.

The result would look like:

in kernel/linux/printk.h:

static inline void printk_deferred_enter(void)
{
	if (!defined(CONFIG_PREEMPT_RT))
		preempt_disable();
	else
		migrate_disable();

	__printk_safe_enter();
}

in mm/page_alloc.c

	printk_deferred_enter();
	write_seqlock_irqsafe(&zonelist_update_seq, flags);


Best Regards,
Petr




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux