On 2023/06/21 23:50, Tetsuo Handa wrote: > On 2023/06/21 23:34, Sebastian Andrzej Siewior wrote: >>> Also, if local_irq_save() is hidden due to RT, what guarantees that >>> >>> write_seqlock_irqsave(&zonelist_update_seq, flags); >>> <<IRQ>> >>> some_timer_function() { >>> printk(); >>> } >>> <<IRQ>> >>> printk_deferred_enter(); >>> >>> does not happen because write_seqlock_irqsave() does not disable IRQ? >> >> I don't see how zonelist_update_seq and printk here are connected >> without the port lock/ or memory allocation. But there are two things >> that are different on RT which probably answer your question: > > It is explained as the first deadlock scenario in commit 1007843a9190 > ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock"). > We have to disable IRQ before making zonelist_update_seq.seqcount odd. > Since we must replace local_irq_save() + write_seqlock() with write_seqlock_irqsave() for CONFIG_PREEMPT_RT=y case but we must not replace local_irq_save() + write_seqlock() with write_seqlock_irqsave() for CONFIG_PREEMPT_RT=n case, the proper fix is something like below? diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 47421bedc12b..e3e9bd719dcc 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5798,28 +5798,30 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta #define BOOT_PAGESET_HIGH 0 #define BOOT_PAGESET_BATCH 1 static DEFINE_PER_CPU(struct per_cpu_pages, boot_pageset); static DEFINE_PER_CPU(struct per_cpu_zonestat, boot_zonestats); static void __build_all_zonelists(void *data) { int nid; int __maybe_unused cpu; pg_data_t *self = data; +#ifndef CONFIG_PREEMPT_RT unsigned long flags; /* * Explicitly disable this CPU's interrupts before taking seqlock * to prevent any IRQ handler from calling into the page allocator * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock. */ local_irq_save(flags); +#endif /* * Explicitly disable this CPU's synchronous printk() before taking * seqlock to prevent any printk() from trying to hold port->lock, for * tty_insert_flip_string_and_push_buffer() on other CPU might be * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held. */ printk_deferred_enter(); write_seqlock(&zonelist_update_seq); #ifdef CONFIG_NUMA @@ -5852,21 +5854,23 @@ static void __build_all_zonelists(void *data) * secondary cpus' numa_mem as they come on-line. During * node/memory hotplug, we'll fixup all on-line cpus. */ for_each_online_cpu(cpu) set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu))); #endif } write_sequnlock(&zonelist_update_seq); printk_deferred_exit(); +#ifndef CONFIG_PREEMPT_RT local_irq_restore(flags); +#endif } static noinline void __init build_all_zonelists_init(void) { int cpu; __build_all_zonelists(NULL); /* By the way, given write_seqlock_irqsave(&zonelist_update_seq, flags); <<IRQ>> some_timer_function() { kmalloc(GFP_ATOMIC); } <</IRQ>> printk_deferred_enter(); scenario in CONFIG_PREEMPT_RT=y case is handled by executing some_timer_function() on a dedicated kernel thread for IRQs, what guarantees that the kernel thread for IRQs gives up CPU and the user thread which called write_seqlock() gains CPU until write_sequnlock() is called? How can the kernel figure out that executing the user thread needs higher priority than the kernel thread?