On Tue, Sep 01, 2020 at 08:52:05AM -0400, Pavel Tatashin wrote: > On Tue, Sep 1, 2020 at 1:28 AM Bharata B Rao <bharata@xxxxxxxxxxxxx> wrote: > > > > On Fri, Aug 28, 2020 at 12:47:03PM -0400, Pavel Tatashin wrote: > > > There appears to be another problem that is related to the > > > cgroup_mutex -> mem_hotplug_lock deadlock described above. > > > > > > In the original deadlock that I described, the workaround is to > > > replace crash dump from piping to Linux traditional save to files > > > method. However, after trying this workaround, I still observed > > > hardware watchdog resets during machine shutdown. > > > > > > The new problem occurs for the following reason: upon shutdown systemd > > > calls a service that hot-removes memory, and if hot-removing fails for > > > some reason systemd kills that service after timeout. However, systemd > > > is never able to kill the service, and we get hardware reset caused by > > > watchdog or a hang during shutdown: > > > > > > Thread #1: memory hot-remove systemd service > > > Loops indefinitely, because if there is something still to be migrated > > > this loop never terminates. However, this loop can be terminated via > > > signal from systemd after timeout. > > > __offline_pages() > > > do { > > > pfn = scan_movable_pages(pfn, end_pfn); > > > # Returns 0, meaning there is nothing available to > > > # migrate, no page is PageLRU(page) > > > ... > > > ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn, > > > NULL, check_pages_isolated_cb); > > > # Returns -EBUSY, meaning there is at least one PFN that > > > # still has to be migrated. > > > } while (ret); > > > > > > Thread #2: ccs killer kthread > > > css_killed_work_fn > > > cgroup_mutex <- Grab this Mutex > > > mem_cgroup_css_offline > > > memcg_offline_kmem.part > > > memcg_deactivate_kmem_caches > > > get_online_mems > > > mem_hotplug_lock <- waits for Thread#1 to get read access > > > > > > Thread #3: systemd > > > ksys_read > > > vfs_read > > > __vfs_read > > > seq_read > > > proc_single_show > > > proc_cgroup_show > > > mutex_lock -> wait for cgroup_mutex that is owned by Thread #2 > > > > > > Thus, thread #3 systemd stuck, and unable to deliver timeout interrupt > > > to thread #1. > > > > > > The proper fix for both of the problems is to avoid cgroup_mutex -> > > > mem_hotplug_lock ordering that was recently fixed in the mainline but > > > still present in all stable branches. Unfortunately, I do not see a > > > simple fix in how to remove mem_hotplug_lock from > > > memcg_deactivate_kmem_caches without using Roman's series that is too > > > big for stable. > > > > We too are seeing this on Power systems when stress-testing memory > > hotplug, but with the following call trace (from hung task timer) > > instead of Thread #2 above: > > > > __switch_to > > __schedule > > schedule > > percpu_rwsem_wait > > __percpu_down_read > > get_online_mems > > memcg_create_kmem_cache > > memcg_kmem_cache_create_func > > process_one_work > > worker_thread > > kthread > > ret_from_kernel_thread > > > > While I understand that Roman's new slab controller patchset will fix > > this, I also wonder if infinitely looping in the memory unplug path > > with mem_hotplug_lock held is the right thing to do? Earlier we had > > a few other exit possibilities in this path (like max retries etc) > > but those were removed by commits: > > > > 72b39cfc4d75: mm, memory_hotplug: do not fail offlining too early > > ecde0f3e7f9e: mm, memory_hotplug: remove timeout from __offline_memory > > > > Or, is the user-space test is expected to induce a signal back-off when > > unplug doesn't complete within a reasonable amount of time? > > Hi Bharata, > > Thank you for your input, it looks like you are experiencing the same > problems that I observed. > > What I found is that the reason why our machines did not complete > hot-remove within the given time is because of this bug: > https://lore.kernel.org/linux-mm/20200901124615.137200-1-pasha.tatashin@xxxxxxxxxx > > Could you please try it and see if that helps for your case? I am on an old codebase that already has the fix that you are proposing, so I might be seeing someother issue which I will debug further. So looks like the loop in __offline_pages() had a call to drain_all_pages() before it was removed by c52e75935f8d: mm: remove extra drain pages on pcp list Regards, Bharata.