On Thursday, March 09, 2017 11:15:47 PM Rafael J. Wysocki wrote: > On Thursday, March 09, 2017 10:10:31 AM Dan Williams wrote: > > On Thu, Mar 9, 2017 at 5:39 AM, Rafael J. Wysocki <rjw@xxxxxxxxxxxxx> wrote: > > > On Thursday, March 09, 2017 02:06:15 PM Heiko Carstens wrote: > > >> Commit bfc8c90139eb ("mem-hotplug: implement get/put_online_mems") > > >> introduced new functions get/put_online_mems() and > > >> mem_hotplug_begin/end() in order to allow similar semantics for memory > > >> hotplug like for cpu hotplug. > > >> > > >> The corresponding functions for cpu hotplug are get/put_online_cpus() > > >> and cpu_hotplug_begin/done() for cpu hotplug. > > >> > > >> The commit however missed to introduce functions that would serialize > > >> memory hotplug operations like they are done for cpu hotplug with > > >> cpu_maps_update_begin/done(). > > >> > > >> This basically leaves mem_hotplug.active_writer unprotected and allows > > >> concurrent writers to modify it, which may lead to problems as > > >> outlined by commit f931ab479dd2 ("mm: fix devm_memremap_pages crash, > > >> use mem_hotplug_{begin, done}"). > > >> > > >> That commit was extended again with commit b5d24fda9c3d ("mm, > > >> devm_memremap_pages: hold device_hotplug lock over mem_hotplug_{begin, > > >> done}") which serializes memory hotplug operations for some call > > >> sites by using the device_hotplug lock. > > >> > > >> In addition with commit 3fc21924100b ("mm: validate device_hotplug is > > >> held for memory hotplug") a sanity check was added to > > >> mem_hotplug_begin() to verify that the device_hotplug lock is held. > > > > > > Admittedly, I haven't looked at all of the code paths involved in detail yet, > > > but there's one concern regarding lock/unlock_device_hotplug(). > > > > > > The actual main purpose of it is to ensure safe removal of devices in cases > > > when they cannot be removed separately, like when a whole CPU package > > > (including possibly an entire NUMA node with memory and all) is removed. > > > > > > One of the code paths doing that is acpi_scan_hot_remove() which first > > > tries to offline devices slated for removal and then finally removes them. > > > > > > The reason why this needs to be done in two stages is because the offlining > > > can fail, in which case we will fail the entire operation, while the final > > > removal step is, well, final (meaning that the devices are gone after it no > > > matter what). > > > > > > This is done under device_hotplug_lock, so that the devices that were taken > > > offline in stage 1 cannot be brought back online before stage 2 is carried > > > out entirely, which surely would be bad if it happened. > > > > > > Now, I'm not sure if removing lock/unlock_device_hotplug() from the code in > > > question actually affects this mechanism, but this in case it does, it is one > > > thing to double check before going ahead with this patch. > > > > > > > I *think* we're ok in this case because unplugging the CPU package > > that contains a persistent memory device will trigger > > devm_memremap_pages() to call arch_remove_memory(). Removing a pmem > > device can't fail. It may be held off while pages are pinned for DMA > > memory, but it will eventually complete. > > What about the offlining, though? Is it guaranteed that no memory from those > ranges will go back online after the acpi_scan_try_to_offline() call in > acpi_scan_hot_remove()? My point is that after the acpi_evaluate_ej0() in acpi_scan_hot_remove() the hardware is physically gone, so if anything is still doing DMA to that memory at that point, then the user is going to be unhappy. Thanks, Rafael -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>