On Tue, May 09, 2023 at 12:00:30PM +0900, Sergey Senozhatsky wrote: > On (23/05/08 09:00), Nhat Pham wrote: > > > The deeper bug here is that zs_map_object() tries to add the page to > > > the LRU list while the shrinker has it isolated for reclaim. This is > > > way too sutble and error prone. Even if it worked now, it'll cause > > > corruption issues down the line. > > > > > > For example, Nhat is adding a secondary entry point to reclaim. > > > Reclaim expects that a page that's on the LRU is also on the fullness > > > list, so this would lead to a double remove_zspage() and BUG_ON(). > > > > > > This patch doesn't just fix the crash, it eliminates the deeper LRU > > > isolation issue and makes the code more robust and simple. > > > > I agree. IMO, less unnecessary concurrent interaction is always a > > win for developers' and maintainers' cognitive load. > > Thanks for all the explanations. > > > As a side benefit - this also gets rid of the inelegant check > > (mm == ZS_MM_WO). The fact that we had to include a > > a multi-paragraph explanation for a 3-line piece of code > > should have been a red flag. > > Minchan had some strong opinion on that, so we need to hear from him > before we decide how do we fix it. I'd be happy if he could validate the fix. But this fixes a crash, so the clock is ticking. I will also say, his was a design preference. One we agreed to only very reluctantly: https://lore.kernel.org/lkml/Y3f6habiVuV9LMcu@xxxxxxxxxx/ Now we have a crash that is a direct result of it, and which cost us (and apparently is still costing us) time and energy to resolve. Unless somebody surfaces a real technical problem with the fix, I'd say let's do it our way this time.