This patch resolves problem in 4.4 & 4.9 PREEMPT_RT kernels that the following WARNING happens repeatedly due to broken context caused by running slab allocation with IRQ disabled by mistake. WARNING: CPU: * PID: ** at */kernel/cpu.c:197 unpin_current_cpu+0x60/0x70() The system is almost unresponsive and the boot stalls once it occurs. This repeated WARNING only happens while kernel is booting (before reaches the userland) with a quite low reproducibility: Only one time in around 1,000 ~ 10,000 reboots. [Problem details] On PREEMPT_RT kernels < v4.14-rt, after __slab_alloc() disables IRQ with local_irq_save(), allocate_slab() is responsible for re-enabling IRQ only under the specific conditions: (1) gfpflags_allow_blocking(flags) OR (2) system_state == SYSTEM_RUNNING The problem happens when (1) is false AND system_state == SYSTEM_BOOTING, caused by the following scenario: 1. Some kernel codes invokes the allocator without __GFP_DIRECT_RECLAIM bit (i.e. blocking not allowed) while SYSTEM_BOOTING 2. allocate_slab() calls the following functions with IRQ disabled 3. buffered_rmqueue() invokes local_[spin_]lock_irqsave(pa_lock) which might call schedule() and enable IRQ, if it failed to get pa_lock 4. The migrate_disable counter, which is not intended to be updated with IRQs disabled, is accidentally updated after schedule() then migrate_enable() raises WARN_ON_ONCE(p->migrate_disable <= 0) 5. The unpin_current_cpu() WARNING is raised eventually because the refcount counter is linked to the migrate_disable counter The behavior 2-5 above has been obsereved[1] using ftrace. The condition (2) above intends to make the memory allocator fully preemptible on PREEMPT_RT kernels[2], so the lock function in the step 3 above should work if SYSTEM_RUNNING but not if SYSTEM_BOOTING. [How this is resolved in newer RT kernels] A patch series in the mainline (v4.13) introduces SYSTEM_SCHEDULING[3]. On top of this, v4.14-rt (6cec8467) changes the condition (2) above: - if (system_state == SYSTEM_RUNNING) + if (system_state > SYSTEM_BOOTING) This avoids the problem by enabling IRQ after SYSTEM_SCHEULDING. Thus, the conditions that allocate_slab() enables IRQ are like: (2)system_state v4.9-rt or before v4.14-rt or later SYSTEM_BOOTING (1)==true (1)==true : : : v SYSTEM_SCHEDULING : < Problem Always v < occurs here | SYSTEM_RUNNING Always | | | v v [How this patch works] An simple option would be to backport the series[3], which is possible and has been verified[4]. However, that series pulls functional changes like SYSTEM_SCHEDULING and adjustments for it, early might_sleep() and smp_processor_id() supports, etc. Therefore, this patch uses an extra (but not mainline) flag "system_scheduling" provided by the prior patch instead of introducing SYSTEM_SCHEDULING, then uses the same condition as newer RT kernels in allocate_slab(). This patch also applies the fix in v5.4-rt (7adf5bc5) to care SYSTEM_SUSPEND in the condition check. [1] https://lore.kernel.org/all/TYCPR01MB11385E3CDF05544B63F7EF9C1E1622@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ [2] https://docs.kernel.org/locking/locktypes.html#raw-spinlock-t-on-rt [3] https://lore.kernel.org/all/20170516184231.564888231@xxxxxxxxxxxxx/T/ [4] https://lore.kernel.org/all/TYCPR01MB1138579CA7612B568BB880652E1272@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ Signed-off-by: Kazuhiro Hayashi <kazuhiro3.hayashi@xxxxxxxxxxxxx> Reviewed-by: Pavel Machek <pavel@xxxxxxx> --- mm/slub.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/slub.c b/mm/slub.c index fd23ff951395..6186a2586289 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1412,7 +1412,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) if (gfpflags_allow_blocking(flags)) enableirqs = true; #ifdef CONFIG_PREEMPT_RT_FULL - if (system_state == SYSTEM_RUNNING) + /* SYSTEM_SCHEDULING <= system_state < SYSTEM_SUSPEND in the mainline */ + if (system_scheduling && system_state < SYSTEM_SUSPEND) enableirqs = true; #endif if (enableirqs) -- 2.30.2