The patch titled Subject: mm, hotplug: fix concurrent memory hot-add deadlock has been removed from the -mm tree. Its filename was mm-hotplug-fix-concurrent-memory-hot-add-deadlock.patch This patch was dropped because it was merged into mainline or a subsystem tree ------------------------------------------------------ From: David Rientjes <rientjes@xxxxxxxxxx> Subject: mm, hotplug: fix concurrent memory hot-add deadlock There's a deadlock when concurrently hot-adding memory through the probe interface and switching a memory block from offline to online. When hot-adding memory via the probe interface, add_memory() first takes mem_hotplug_begin() and then device_lock() is later taken when registering the newly initialized memory block. This creates a lock dependency of (1) mem_hotplug.lock (2) dev->mutex. When switching a memory block from offline to online, dev->mutex is first grabbed in device_online() when the write(2) transitions an existing memory block from offline to online, and then online_pages() will take mem_hotplug_begin(). This creates a lock inversion between mem_hotplug.lock and dev->mutex. Vitaly reports that this deadlock can happen when kworker handling a probe event races with systemd-udevd switching a memory block's state. This patch requires the state transition to take mem_hotplug_begin() before dev->mutex. Hot-adding memory via the probe interface creates a memory block while holding mem_hotplug_begin(), there is no way to take dev->mutex first in this case. online_pages() and offline_pages() are only called when transitioning memory block state. We now require that mem_hotplug_begin() is taken before calling them -- this requires exporting the mem_hotplug_begin() and mem_hotplug_done() to generic code. In all hot-add and hot-remove cases, mem_hotplug_begin() is done prior to device_online(). This is all that is needed to avoid the deadlock. Signed-off-by: David Rientjes <rientjes@xxxxxxxxxx> Reported-by: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> Tested-by: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> Cc: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx> Cc: "Rafael J. Wysocki" <rjw@xxxxxxxxxxxxx> Cc: "K. Y. Srinivasan" <kys@xxxxxxxxxxxxx> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@xxxxxxxxxxxxxx> Cc: Tang Chen <tangchen@xxxxxxxxxxxxxx> Cc: Vlastimil Babka <vbabka@xxxxxxx> Cc: Zhang Zhen <zhenzhang.zhang@xxxxxxxxxx> Cc: Vladimir Davydov <vdavydov@xxxxxxxxxxxxx> Cc: Wang Nan <wangnan0@xxxxxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- drivers/base/memory.c | 19 +++++++++++------ include/linux/memory_hotplug.h | 6 +++++ mm/memory_hotplug.c | 33 +++++++++++-------------------- 3 files changed, 30 insertions(+), 28 deletions(-) diff -puN drivers/base/memory.c~mm-hotplug-fix-concurrent-memory-hot-add-deadlock drivers/base/memory.c --- a/drivers/base/memory.c~mm-hotplug-fix-concurrent-memory-hot-add-deadlock +++ a/drivers/base/memory.c @@ -219,6 +219,7 @@ static bool pages_correctly_reserved(uns /* * MEMORY_HOTPLUG depends on SPARSEMEM in mm/Kconfig, so it is * OK to have direct references to sparsemem variables in here. + * Must already be protected by mem_hotplug_begin(). */ static int memory_block_action(unsigned long phys_index, unsigned long action, int online_type) @@ -286,6 +287,7 @@ static int memory_subsys_online(struct d if (mem->online_type < 0) mem->online_type = MMOP_ONLINE_KEEP; + /* Already under protection of mem_hotplug_begin() */ ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE); /* clear online_type */ @@ -328,17 +330,19 @@ store_mem_state(struct device *dev, goto err; } + /* + * Memory hotplug needs to hold mem_hotplug_begin() for probe to find + * the correct memory block to online before doing device_online(dev), + * which will take dev->mutex. Take the lock early to prevent an + * inversion, memory_subsys_online() callbacks will be implemented by + * assuming it's already protected. + */ + mem_hotplug_begin(); + switch (online_type) { case MMOP_ONLINE_KERNEL: case MMOP_ONLINE_MOVABLE: case MMOP_ONLINE_KEEP: - /* - * mem->online_type is not protected so there can be a - * race here. However, when racing online, the first - * will succeed and the second will just return as the - * block will already be online. The online type - * could be either one, but that is expected. - */ mem->online_type = online_type; ret = device_online(&mem->dev); break; @@ -349,6 +353,7 @@ store_mem_state(struct device *dev, ret = -EINVAL; /* should never happen */ } + mem_hotplug_done(); err: unlock_device_hotplug(); diff -puN include/linux/memory_hotplug.h~mm-hotplug-fix-concurrent-memory-hot-add-deadlock include/linux/memory_hotplug.h --- a/include/linux/memory_hotplug.h~mm-hotplug-fix-concurrent-memory-hot-add-deadlock +++ a/include/linux/memory_hotplug.h @@ -192,6 +192,9 @@ extern void get_page_bootmem(unsigned lo void get_online_mems(void); void put_online_mems(void); +void mem_hotplug_begin(void); +void mem_hotplug_done(void); + #else /* ! CONFIG_MEMORY_HOTPLUG */ /* * Stub functions for when hotplug is off @@ -231,6 +234,9 @@ static inline int try_online_node(int ni static inline void get_online_mems(void) {} static inline void put_online_mems(void) {} +static inline void mem_hotplug_begin(void) {} +static inline void mem_hotplug_done(void) {} + #endif /* ! CONFIG_MEMORY_HOTPLUG */ #ifdef CONFIG_MEMORY_HOTREMOVE diff -puN mm/memory_hotplug.c~mm-hotplug-fix-concurrent-memory-hot-add-deadlock mm/memory_hotplug.c --- a/mm/memory_hotplug.c~mm-hotplug-fix-concurrent-memory-hot-add-deadlock +++ a/mm/memory_hotplug.c @@ -104,7 +104,7 @@ void put_online_mems(void) } -static void mem_hotplug_begin(void) +void mem_hotplug_begin(void) { mem_hotplug.active_writer = current; @@ -119,7 +119,7 @@ static void mem_hotplug_begin(void) } } -static void mem_hotplug_done(void) +void mem_hotplug_done(void) { mem_hotplug.active_writer = NULL; mutex_unlock(&mem_hotplug.lock); @@ -959,6 +959,7 @@ static void node_states_set_node(int nod } +/* Must be protected by mem_hotplug_begin() */ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_type) { unsigned long flags; @@ -969,7 +970,6 @@ int __ref online_pages(unsigned long pfn int ret; struct memory_notify arg; - mem_hotplug_begin(); /* * This doesn't need a lock to do pfn_to_page(). * The section can't be removed here because of the @@ -977,21 +977,20 @@ int __ref online_pages(unsigned long pfn */ zone = page_zone(pfn_to_page(pfn)); - ret = -EINVAL; if ((zone_idx(zone) > ZONE_NORMAL || online_type == MMOP_ONLINE_MOVABLE) && !can_online_high_movable(zone)) - goto out; + return -EINVAL; if (online_type == MMOP_ONLINE_KERNEL && zone_idx(zone) == ZONE_MOVABLE) { if (move_pfn_range_left(zone - 1, zone, pfn, pfn + nr_pages)) - goto out; + return -EINVAL; } if (online_type == MMOP_ONLINE_MOVABLE && zone_idx(zone) == ZONE_MOVABLE - 1) { if (move_pfn_range_right(zone, zone + 1, pfn, pfn + nr_pages)) - goto out; + return -EINVAL; } /* Previous code may changed the zone of the pfn range */ @@ -1007,7 +1006,7 @@ int __ref online_pages(unsigned long pfn ret = notifier_to_errno(ret); if (ret) { memory_notify(MEM_CANCEL_ONLINE, &arg); - goto out; + return ret; } /* * If this zone is not populated, then it is not in zonelist. @@ -1031,7 +1030,7 @@ int __ref online_pages(unsigned long pfn (((unsigned long long) pfn + nr_pages) << PAGE_SHIFT) - 1); memory_notify(MEM_CANCEL_ONLINE, &arg); - goto out; + return ret; } zone->present_pages += onlined_pages; @@ -1061,9 +1060,7 @@ int __ref online_pages(unsigned long pfn if (onlined_pages) memory_notify(MEM_ONLINE, &arg); -out: - mem_hotplug_done(); - return ret; + return 0; } #endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */ @@ -1688,21 +1685,18 @@ static int __ref __offline_pages(unsigne if (!test_pages_in_a_zone(start_pfn, end_pfn)) return -EINVAL; - mem_hotplug_begin(); - zone = page_zone(pfn_to_page(start_pfn)); node = zone_to_nid(zone); nr_pages = end_pfn - start_pfn; - ret = -EINVAL; if (zone_idx(zone) <= ZONE_NORMAL && !can_offline_normal(zone, nr_pages)) - goto out; + return -EINVAL; /* set above range as isolated */ ret = start_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE, true); if (ret) - goto out; + return ret; arg.start_pfn = start_pfn; arg.nr_pages = nr_pages; @@ -1795,7 +1789,6 @@ repeat: writeback_set_ratelimit(); memory_notify(MEM_OFFLINE, &arg); - mem_hotplug_done(); return 0; failed_removal: @@ -1805,12 +1798,10 @@ failed_removal: memory_notify(MEM_CANCEL_OFFLINE, &arg); /* pushback to free area */ undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE); - -out: - mem_hotplug_done(); return ret; } +/* Must be protected by mem_hotplug_begin() */ int offline_pages(unsigned long start_pfn, unsigned long nr_pages) { return __offline_pages(start_pfn, start_pfn + nr_pages, 120 * HZ); _ Patches currently in -mm which might be from rientjes@xxxxxxxxxx are origin.patch mm-mempool-do-not-allow-atomic-resizing.patch mm-hugetlb-abort-__get_user_pages-if-current-has-been-oom-killed.patch cxgb4-drop-__gfp_nofail-allocation.patch jbd2-revert-must-not-fail-allocation-loops-back-to-gfp_nofail.patch slab-infrastructure-for-bulk-object-allocation-and-freeing-v3.patch slub-bulk-alloc-extract-objects-from-the-per-cpu-slab.patch slub-bulk-allocation-from-per-cpu-partial-pages.patch slub-bulk-allocation-from-per-cpu-partial-pages-fix.patch mm-refactor-zone_movable_is_highmem.patch mm-memory-failurec-define-page-types-for-action_result-in-one-place.patch page-flags-define-behavior-slb-related-flags-on-compound-pages.patch allow-compaction-of-unevictable-pages.patch document-interaction-between-compaction-and-the-unevictable-lru.patch document-interaction-between-compaction-and-the-unevictable-lru-fix.patch mm-memcg-sync-allocation-and-memcg-charge-gfp-flags-for-thp.patch mm-memcg-sync-allocation-and-memcg-charge-gfp-flags-for-thp-fix-fix.patch mm-compaction-reset-compaction-scanner-positions.patch hugetlbfs-add-minimum-size-tracking-fields-to-subpool-structure.patch hugetlbfs-add-minimum-size-accounting-to-subpools.patch hugetlbfs-accept-subpool-min_size-mount-option-and-setup-accordingly.patch hugetlbfs-document-min_size-mount-option-and-cleanup.patch mm-vmalloc-fix-possible-exhaustion-of-vmalloc-space-caused-by-vm_map_ram-allocator.patch mm-vmalloc-occupy-newly-allocated-vmap-block-just-after-allocation.patch mm-vmalloc-get-rid-of-dirty-bitmap-inside-vmap_block-structure.patch mremap-should-return-enomem-when-__vm_enough_memory-fail.patch clean-up-goto-just-return-err_ptr.patch fs-jfs-remove-slab-object-constructor.patch mm-mempool-disallow-mempools-based-on-slab-caches-with-constructors.patch mm-mempool-poison-elements-backed-by-slab-allocator.patch mm-mempool-poison-elements-backed-by-page-allocator.patch mm-mempool-poison-elements-backed-by-page-allocator-fix.patch mm-mempool-poison-elements-backed-by-page-allocator-fix-fix.patch mm-mempool-poison-elements-backed-by-page-allocator-fix-fix-fix.patch thp-handle-errors-in-hugepage_init-properly.patch thp-do-not-adjust-zone-water-marks-if-khugepaged-is-not-started.patch mm-doc-cleanup-and-clarify-munmap-behavior-for-hugetlb-memory.patch mm-doc-cleanup-and-clarify-munmap-behavior-for-hugetlb-memory-fix.patch mm-selftests-test-return-value-of-munmap-for-map_hugetlb-memory.patch mm-dont-call-__page_cache_release-for-hugetlb.patch mm-hugetlb-introduce-pagehugeactive-flag.patch mm-hugetlb-introduce-pagehugeactive-flag-fix.patch mm-hugetlb-cleanup-using-pagehugeactive-flag.patch mm-hugetlb-cleanup-using-pagehugeactive-flag-fix.patch thp-cleanup-khugepaged-startup.patch mm-mempool-kasan-poison-mempool-elements.patch hung_task-change-hung_taskc-to-use-for_each_process_thread.patch mm-utilc-add-kstrimdup.patch linux-next.patch -- To unsubscribe from this list: send the line "unsubscribe mm-commits" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html