On Fri, 25 Feb 2022, Andrew Morton wrote: > On Mon, 21 Feb 2022 11:17:49 +0000 cgel.zte@xxxxxxxxx wrote: > > From: Guo Ziliang <guo.ziliang@xxxxxxxxxx> > > > > In our testing, a deadloop task was found. Through sysrq printing, same > > stack was found every time, as follows: > > __swap_duplicate+0x58/0x1a0 > > swapcache_prepare+0x24/0x30 > > __read_swap_cache_async+0xac/0x220 > > read_swap_cache_async+0x58/0xa0 > > swapin_readahead+0x24c/0x628 > > do_swap_page+0x374/0x8a0 > > __handle_mm_fault+0x598/0xd60 > > handle_mm_fault+0x114/0x200 > > do_page_fault+0x148/0x4d0 > > do_translation_fault+0xb0/0xd4 > > do_mem_abort+0x50/0xb0 > > > > The reason for the deadloop is that swapcache_prepare() always returns > > EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that > > it cannot jump out of the loop. We suspect that the task that clears > > the SWAP_HAS_CACHE flag never gets a chance to run. We try to lower > > the priority of the task stuck in a deadloop so that the task that > > clears the SWAP_HAS_CACHE flag will run. The results show that the > > system returns to normal after the priority is lowered. > > > > In our testing, multiple real-time tasks are bound to the same core, > > and the task in the deadloop is the highest priority task of the > > core, so the deadloop task cannot be preempted. > > > > Although cond_resched() is used by __read_swap_cache_async, it is an > > empty function in the preemptive system and cannot achieve the purpose > > of releasing the CPU. A high-priority task cannot release the CPU > > unless preempted by a higher-priority task. But when this task > > is already the highest priority task on this core, other tasks > > will not be able to be scheduled. So we think we should replace > > cond_resched() with schedule_timeout_uninterruptible(1), > > schedule_timeout_interruptible will call set_current_state > > first to set the task state, so the task will be removed > > from the running queue, so as to achieve the purpose of > > giving up the CPU and prevent it from running in kernel > > mode for too long. > > > > ... > > > > --- a/mm/swap_state.c > > +++ b/mm/swap_state.c > > @@ -478,7 +478,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, > > * __read_swap_cache_async(), which has set SWAP_HAS_CACHE > > * in swap_map, but not yet added its page to swap cache. > > */ > > - cond_resched(); > > + schedule_timeout_uninterruptible(1); > > } > > > > /* > > Sigh. I guess yes, we should do this, at least in a short-term, > backportable-to-stable way. > > But busy-waiting while hoping that someone else will save us isn't an > attractive design. Hugh, have you ever thought about something more > deterministic in there? Not something more deterministic, no: I think that would entail heavier locking, perhaps slowing down hotter paths, just to avoid this swap oddity. This loop was written long before there was a preemptive kernel: it was appropriate then, and almost never needed more than one retry to complete; but preemption changed the story without us realizing. Sigh here too. I commend the thread on it from July 2018: https://lore.kernel.org/linux-mm/2018072514403228778860@xxxxxxxxxxxx/ There the 4.9-stable user proposed preempt_disable(), I agreed but found the patch provided insufficient, and offered another 4.9 patch further down the thread. Your preference at the time was msleep(1). I was working on a similar patch for 4.18, but have not completed it yet ;) and don't remember how satisfied or not I was with that one; and wonder if I'm any more likely to get it finished by 2026. It's clear that I put much more thought into it back then than just now. Maybe someone else would have a go: my 4.9 patch in that thread shows most of it, but might need a lot of work to update to 5.17. And it also gathered some temporary debug stats on how often this happens: I'm not conscious of using RT at all, but was disturbed to see how long an ordinary preemptive kernel was sometimes spinning there. So I think I agree with you more than Michal on that: RT just makes the bad behaviour more obvious. Hugh