On 17.05.19 16:09, Pavel Tatashin wrote: >> >> I would think that ACPI hotplug would have a similar problem, but it does this: >> >> acpi_unbind_memory_blocks(info); >> __remove_memory(nid, info->start_addr, info->length); > > ACPI does have exactly the same problem, so this is not a bug for this > series, I will submit a new version of my series with comments > addressed, but without fix for this issue. > > I was able to reproduce this issue on the current mainline kernel. > Also, I been thinking more about how to fix it, and there is no easy > fix without a major hotplug redesign. Basically, we have to remove > sysfs memory entries before or after memory is hotplugged/hotremoved. > But, we also have to guarantee that hotplug/hotremove will succeed or > reinstate sysfs entries. > > Qemu script: > > qemu-system-x86_64 \ > -enable-kvm \ > -cpu host \ > -parallel none \ > -echr 1 \ > -serial none \ > -chardev stdio,id=console,signal=off,mux=on \ > -serial chardev:console \ > -mon chardev=console \ > -vga none \ > -display none \ > -kernel pmem/native/arch/x86/boot/bzImage \ > -m 8G,slots=1,maxmem=16G \ > -smp 8 \ > -fsdev local,id=virtfs1,path=/,security_model=none \ > -device virtio-9p-pci,fsdev=virtfs1,mount_tag=hostfs \ > -append 'earlyprintk=serial,ttyS0,115200 console=ttyS0 > TERM=xterm ip=dhcp loglevel=7' > > Config is attached. > > Steps to reproduce: > # > # QEMU 4.0.0 monitor - type 'help' for more information > (qemu) object_add memory-backend-ram,id=mem1,size=1G > (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 > (qemu) > > # echo online_movable > /sys/devices/system/memory/memory79/state > [ 23.029552] Built 1 zonelists, mobility grouping on. Total pages: 2045370 > [ 23.032591] Policy zone: Normal > # (qemu) device_del dimm1 > (qemu) [ 32.013950] Offlined Pages 32768 > [ 32.014307] Built 1 zonelists, mobility grouping on. Total pages: 2031022 > [ 32.014843] Policy zone: Normal > [ 32.015733] > [ 32.015881] ====================================================== > [ 32.016390] WARNING: possible circular locking dependency detected > [ 32.016881] 5.1.0_pt_pmem #38 Not tainted > [ 32.017202] ------------------------------------------------------ > [ 32.017680] kworker/u16:4/380 is trying to acquire lock: > [ 32.018096] 00000000675cc7e1 (kn->count#18){++++}, at: > kernfs_remove_by_name_ns+0x3b/0x80 > [ 32.018745] > [ 32.018745] but task is already holding lock: > [ 32.019201] 0000000053e50a99 (mem_sysfs_mutex){+.+.}, at: > unregister_memory_section+0x1d/0xa0 > [ 32.019859] > [ 32.019859] which lock already depends on the new lock. > [ 32.019859] > [ 32.020499] > [ 32.020499] the existing dependency chain (in reverse order) is: > [ 32.021080] > [ 32.021080] -> #4 (mem_sysfs_mutex){+.+.}: > [ 32.021522] __mutex_lock+0x8b/0x900 > [ 32.021843] hotplug_memory_register+0x26/0xa0 > [ 32.022231] __add_pages+0xe7/0x160 > [ 32.022545] add_pages+0xd/0x60 > [ 32.022835] add_memory_resource+0xc3/0x1d0 > [ 32.023207] __add_memory+0x57/0x80 > [ 32.023530] acpi_memory_device_add+0x13a/0x2d0 > [ 32.023928] acpi_bus_attach+0xf1/0x200 > [ 32.024272] acpi_bus_scan+0x3e/0x90 > [ 32.024597] acpi_device_hotplug+0x284/0x3e0 > [ 32.024972] acpi_hotplug_work_fn+0x15/0x20 > [ 32.025342] process_one_work+0x2a0/0x650 > [ 32.025755] worker_thread+0x34/0x3d0 > [ 32.026077] kthread+0x118/0x130 > [ 32.026442] ret_from_fork+0x3a/0x50 > [ 32.026766] > [ 32.026766] -> #3 (mem_hotplug_lock.rw_sem){++++}: > [ 32.027261] get_online_mems+0x39/0x80 > [ 32.027600] kmem_cache_create_usercopy+0x29/0x2c0 > [ 32.028019] kmem_cache_create+0xd/0x10 > [ 32.028367] ptlock_cache_init+0x1b/0x23 > [ 32.028724] start_kernel+0x1d2/0x4b8 > [ 32.029060] secondary_startup_64+0xa4/0xb0 > [ 32.029447] > [ 32.029447] -> #2 (cpu_hotplug_lock.rw_sem){++++}: > [ 32.030007] cpus_read_lock+0x39/0x80 > [ 32.030360] __offline_pages+0x32/0x790 > [ 32.030709] memory_subsys_offline+0x3a/0x60 > [ 32.031089] device_offline+0x7e/0xb0 > [ 32.031425] acpi_bus_offline+0xd8/0x140 > [ 32.031821] acpi_device_hotplug+0x1b2/0x3e0 > [ 32.032202] acpi_hotplug_work_fn+0x15/0x20 > [ 32.032576] process_one_work+0x2a0/0x650 > [ 32.032942] worker_thread+0x34/0x3d0 > [ 32.033283] kthread+0x118/0x130 > [ 32.033588] ret_from_fork+0x3a/0x50 > [ 32.033919] > [ 32.033919] -> #1 (&device->physical_node_lock){+.+.}: > [ 32.034450] __mutex_lock+0x8b/0x900 > [ 32.034784] acpi_get_first_physical_node+0x16/0x60 > [ 32.035217] acpi_companion_match+0x3b/0x60 > [ 32.035594] acpi_device_uevent_modalias+0x9/0x20 > [ 32.036012] platform_uevent+0xd/0x40 > [ 32.036352] dev_uevent+0x85/0x1c0 > [ 32.036674] kobject_uevent_env+0x1e2/0x640 > [ 32.037044] kobject_synth_uevent+0x2b7/0x324 > [ 32.037428] uevent_store+0x17/0x30 > [ 32.037752] kernfs_fop_write+0xeb/0x1a0 > [ 32.038112] vfs_write+0xb2/0x1b0 > [ 32.038417] ksys_write+0x57/0xd0 > [ 32.038721] do_syscall_64+0x4b/0x1a0 > [ 32.039053] entry_SYSCALL_64_after_hwframe+0x49/0xbe > [ 32.039491] > [ 32.039491] -> #0 (kn->count#18){++++}: > [ 32.039913] lock_acquire+0xaa/0x180 > [ 32.040242] __kernfs_remove+0x244/0x2d0 > [ 32.040593] kernfs_remove_by_name_ns+0x3b/0x80 > [ 32.040991] device_del+0x14a/0x370 > [ 32.041309] device_unregister+0x9/0x20 > [ 32.041653] unregister_memory_section+0x69/0xa0 > [ 32.042059] __remove_pages+0x112/0x460 > [ 32.042402] arch_remove_memory+0x6f/0xa0 > [ 32.042758] __remove_memory+0xab/0x130 > [ 32.043103] acpi_memory_device_remove+0x67/0xe0 > [ 32.043537] acpi_bus_trim+0x50/0x90 > [ 32.043889] acpi_device_hotplug+0x2fa/0x3e0 > [ 32.044300] acpi_hotplug_work_fn+0x15/0x20 > [ 32.044686] process_one_work+0x2a0/0x650 > [ 32.045044] worker_thread+0x34/0x3d0 > [ 32.045381] kthread+0x118/0x130 > [ 32.045679] ret_from_fork+0x3a/0x50 > [ 32.046005] > [ 32.046005] other info that might help us debug this: > [ 32.046005] > [ 32.046636] Chain exists of: > [ 32.046636] kn->count#18 --> mem_hotplug_lock.rw_sem --> mem_sysfs_mutex > [ 32.046636] > [ 32.047514] Possible unsafe locking scenario: > [ 32.047514] > [ 32.047976] CPU0 CPU1 > [ 32.048337] ---- ---- > [ 32.048697] lock(mem_sysfs_mutex); > [ 32.048983] lock(mem_hotplug_lock.rw_sem); > [ 32.049519] lock(mem_sysfs_mutex); > [ 32.050004] lock(kn->count#18); > [ 32.050270] > [ 32.050270] *** DEADLOCK *** > [ 32.050270] > [ 32.050736] 7 locks held by kworker/u16:4/380: > [ 32.051087] #0: 00000000a22fe78e > ((wq_completion)kacpi_hotplug){+.+.}, at: process_one_work+0x21e/0x650 > [ 32.051830] #1: 00000000944f2dca > ((work_completion)(&hpw->work)){+.+.}, at: > process_one_work+0x21e/0x650 > [ 32.052577] #2: 0000000024bbe147 (device_hotplug_lock){+.+.}, at: > acpi_device_hotplug+0x2e/0x3e0 > [ 32.053271] #3: 000000005cb50027 (acpi_scan_lock){+.+.}, at: > acpi_device_hotplug+0x3c/0x3e0 > [ 32.053916] #4: 00000000b8d06992 (cpu_hotplug_lock.rw_sem){++++}, > at: __remove_memory+0x3b/0x130 > [ 32.054602] #5: 00000000897f0ef4 (mem_hotplug_lock.rw_sem){++++}, > at: percpu_down_write+0x1d/0x110 > [ 32.055315] #6: 0000000053e50a99 (mem_sysfs_mutex){+.+.}, at: > unregister_memory_section+0x1d/0xa0 > [ 32.056016] > [ 32.056016] stack backtrace: > [ 32.056355] CPU: 4 PID: 380 Comm: kworker/u16:4 Not tainted 5.1.0_pt_pmem #38 > [ 32.056923] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS 1.12.0-20181126_142135-anatol 04/01/2014 > [ 32.057720] Workqueue: kacpi_hotplug acpi_hotplug_work_fn > [ 32.058144] Call Trace: > [ 32.058344] dump_stack+0x67/0x90 > [ 32.058604] print_circular_bug.cold.60+0x15c/0x195 > [ 32.058989] __lock_acquire+0x17de/0x1d30 > [ 32.059308] ? find_held_lock+0x2d/0x90 > [ 32.059611] ? __kernfs_remove+0x199/0x2d0 > [ 32.059937] lock_acquire+0xaa/0x180 > [ 32.060223] ? kernfs_remove_by_name_ns+0x3b/0x80 > [ 32.060596] __kernfs_remove+0x244/0x2d0 > [ 32.060908] ? kernfs_remove_by_name_ns+0x3b/0x80 > [ 32.061283] ? kernfs_name_hash+0xd/0x80 > [ 32.061596] ? kernfs_find_ns+0x68/0xf0 > [ 32.061907] kernfs_remove_by_name_ns+0x3b/0x80 > [ 32.062266] device_del+0x14a/0x370 > [ 32.062548] ? unregister_mem_sect_under_nodes+0x4f/0xc0 > [ 32.062973] device_unregister+0x9/0x20 > [ 32.063285] unregister_memory_section+0x69/0xa0 > [ 32.063651] __remove_pages+0x112/0x460 > [ 32.063949] arch_remove_memory+0x6f/0xa0 > [ 32.064271] __remove_memory+0xab/0x130 > [ 32.064579] ? walk_memory_range+0xa1/0xe0 > [ 32.064907] acpi_memory_device_remove+0x67/0xe0 > [ 32.065274] acpi_bus_trim+0x50/0x90 > [ 32.065560] acpi_device_hotplug+0x2fa/0x3e0 > [ 32.065900] acpi_hotplug_work_fn+0x15/0x20 > [ 32.066249] process_one_work+0x2a0/0x650 > [ 32.066591] worker_thread+0x34/0x3d0 > [ 32.066925] ? process_one_work+0x650/0x650 > [ 32.067275] kthread+0x118/0x130 > [ 32.067542] ? kthread_create_on_node+0x60/0x60 > [ 32.067909] ret_from_fork+0x3a/0x50 > >> >> I wonder if that ordering prevents going too deep into the >> device_unregister() call stack that you highlighted below. >> >> >>> >>> Here is the problem: >>> >>> When we offline pages we have the following call stack: >>> >>> # echo offline > /sys/devices/system/memory/memory8/state >>> ksys_write >>> vfs_write >>> __vfs_write >>> kernfs_fop_write >>> kernfs_get_active >>> lock_acquire kn->count#122 (lock for >>> "memory8/state" kn) >>> sysfs_kf_write >>> dev_attr_store >>> state_store >>> device_offline >>> memory_subsys_offline >>> memory_block_action >>> offline_pages >>> __offline_pages >>> percpu_down_write >>> down_write >>> lock_acquire mem_hotplug_lock.rw_sem >>> >>> When we unbind dax0.0 we have the following stack: >>> # echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind >>> drv_attr_store >>> unbind_store >>> device_driver_detach >>> device_release_driver_internal >>> dev_dax_kmem_remove >>> remove_memory device_hotplug_lock >>> try_remove_memory mem_hotplug_lock.rw_sem >>> arch_remove_memory >>> __remove_pages >>> __remove_section >>> unregister_memory_section >>> remove_memory_section mem_sysfs_mutex >>> unregister_memory >>> device_unregister >>> device_del >>> device_remove_attrs >>> sysfs_remove_groups >>> sysfs_remove_group >>> remove_files >>> kernfs_remove_by_name >>> kernfs_remove_by_name_ns >>> __kernfs_remove kn->count#122 >>> >>> So, lockdep found the ordering issue with the above two stacks: >>> >>> 1. kn->count#122 -> mem_hotplug_lock.rw_sem >>> 2. mem_hotplug_lock.rw_sem -> kn->count#122 I once documented locking behavior in Documentation/core-api/memory-hotplug.rst Both, device_online() and __remove_memory() always have to be called holding the device_hotplug_lock(), to avoid such races. # echo offline > /sys/devices/system/memory/memory8/state and # echo 0 > /sys/devices/system/memory/memory8/online either end up in: -> online_store() -- lock_device_hotplug_sysfs(); -- device_offline(dev) -> state_store() -- lock_device_hotplug_sysfs(); -- device_online(&mem->dev); So the device_hotplug_lock prohibits the race you describe. BUT There is a possible race between the device_hotplug_lock and the kn->count#122. However, that race can never trigger, as userspace properly backs off in case it cannot get grip of the device_hotplug_lock. (lock_device_hotplug_sysfs does a mutex_trylock(&device_hotplug_lock)) -- Thanks, David / dhildenb