On 2023/06/22 16:18, Michal Hocko wrote: >>> It is explained as the first deadlock scenario in commit 1007843a9190 >>> ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock"). >>> We have to disable IRQ before making zonelist_update_seq.seqcount odd. >>> >> >> Since we must replace local_irq_save() + write_seqlock() with write_seqlock_irqsave() for >> CONFIG_PREEMPT_RT=y case but we must not replace local_irq_save() + write_seqlock() with >> write_seqlock_irqsave() for CONFIG_PREEMPT_RT=n case, the proper fix is something like below? > > Now, I am confused. Why write_seqlock_irqsave is not allowed for !RT? > Let me quote the changelog and he scenario 1: > write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd > // e.g. timer interrupt handler runs at this moment > some_timer_func() { > kmalloc(GFP_ATOMIC) { > __alloc_pages_slowpath() { > read_seqbegin(&zonelist_update_seq) { > // spins forever because zonelist_update_seq.seqcount is odd > } > } > } > } > // e.g. timer interrupt handler finishes > write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even > > This is clearly impossible with write_seqlock_irqsave as interrupts are > disabled before the lock is taken. Well, it seems that "I don't want to replace" rather than "we must not replace". I reread the thread but I couldn't find why nobody suggested write_seqlock_irqsave(). The reason I proposed the local_irq_save() => printk_deferred_enter() => write_seqlock() ordering implies a precaution in case write_seqlock() involves printk() (e.g. lockdep, KCSAN, soft-lockup warning), in addition to "local_irq_save() before printk_deferred_enter()" requirement. Maybe people in that thread were happy with preserving this precaution... You commented There shouldn't be any other locks (apart from hotplug) taken in that path IIRC. at https://lkml.kernel.org/ZCrYQj+2/uMtqNBm@xxxxxxxxxxxxxx . If __build_all_zonelists() is already serialized by hotplug lock, we don't need to call spin_lock(&zonelist_update_seq.lock) and we will be able to replace write_seqlock(&zonelist_update_seq) with write_seqcount_begin(&zonelist_update_seq.seqcount) like cpuset_change_task_nodemask() does?