On Thu, 2022-08-04 at 10:35 +1200, Kai Huang wrote: > On Wed, 2022-08-03 at 07:20 -0700, Dave Hansen wrote: > > On 8/2/22 19:37, Kai Huang wrote: > > > On Thu, 2022-07-21 at 13:52 +1200, Kai Huang wrote: > > > > Also, if I understand correctly above, your suggestion is we want to prevent any > > > > CMR memory going offline so it won't be hot-removed (assuming we can get CMRs > > > > during boot). This looks contradicts to the requirement of being able to allow > > > > moving memory from core-mm to driver. When we offline the memory, we cannot > > > > know whether the memory will be used by driver, or later hot-removed. > > > Hi Dave, > > > > > > The high level flow of device hot-removal is: > > > > > > acpi_scan_hot_remove() > > > -> acpi_scan_try_to_offline() > > > -> acpi_bus_offline() > > > -> device_offline() > > > -> memory_subsys_offline() > > > -> acpi_bus_trim() > > > -> acpi_memory_device_remove() > > > > > > > > > And memory_subsys_offline() can also be triggered via /sysfs: > > > > > > echo 0 > /sys/devices/system/memory/memory30/online > > > > > > After the memory block is offline, my understanding is kernel can theoretically > > > move it to, i.e. ZONE_DEVICE via memremap_pages(). > > > > > > As you can see memory_subsys_offline() is the entry point of memory device > > > offline (before it the code is generic for all ACPI device), and it cannot > > > distinguish whether the removal is from ACPI event, or from /sysfs, so it seems > > > we are unable to refuse to offline memory in memory_subsys_offline() when it is > > > called from ACPI event. > > > > > > Any comments? > > > > I suggest refactoring the code in a way that makes it possible to > > distinguish the two cases. > > > > It's not like you have some binary kernel. You have the source code for > > the whole thing and can propose changes *ANYWHERE* you need. Even better: > > > > $ grep -A2 ^ACPI\$ MAINTAINERS > > ACPI > > M: "Rafael J. Wysocki" <rafael@xxxxxxxxxx> > > R: Len Brown <lenb@xxxxxxxxxx> > > > > The maintainer of ACPI works for our employer. Plus, he's a nice > > helpful guy that you can go ask how you might refactor this or > > approaches you might take. Have you talked to Rafael about this issue? > > Rafael once also suggested to set hotplug.enabled to 0 as your code shows below, > but we just got the TDX architecture behaviour of memory hotplug clarified from > Intel TDX guys recently. > > > Also, from a two-minute grepping session, I noticed this: > > > > > static acpi_status acpi_bus_offline(acpi_handle handle, u32 lvl, void *data, > > > void **ret_p) > > > { > > ... > > > if (device->handler && !device->handler->hotplug.enabled) { > > > *ret_p = &device->dev; > > > return AE_SUPPORT; > > > } > > > > It looks to me like if you simply set: > > > > memory_device_handler->hotplug.enabled = false; > > > > you'll get most of the behavior you want. ACPI memory hotplug would not > > work and the changes would be confined to the ACPI world. The > > "lower-level" bus-based hotplug would be unaffected. > > > > Now, I don't know what kind of locking would be needed to muck with a > > global structure like that. But, it's a start. > > This has two problems: > > 1) This approach cannot distinguish non-CMR memory hotplug and CMR memory > hotplug, as it disables ACPI memory hotplug for all. But this is fine as we > want to reject non-CMR memory hotplug anyway. We just need to explain clearly > in changelog. > > 2) This won't allow the kernel to speak out "BIOS bug" when CMR memory hotplug > actually happens. Instead, we can only print out "hotplug is disabled due to > TDX is enabled by BIOS." when we set hotplug.enable to false. > > Assuming above is OK, I'll explore this option. I'll also do some research to > see if it's still possible to speak out "BIOS bug" in this approach but it's not > a mandatory requirement to me now. > > Also, if print out "BIOS bug" for CMR memory hotplug isn't mandatory, then we > can just detect TDX during kernel boot, and disable hotplug when TDX is enabled > by BIOS, but don't need to use "winner-take-all" approach. The former is > clearer and easier to implement. I'll go with the former approach if I don't > hear objection from you. > > And ACPI CPU hotplug can also use the same way. > > Please let me know any comments. Thanks! > One more reason why "winner-take-all" approach doesn't work: If we allow ACPI memory hotplug to happen but choose to disable it in the handler using "winner-take-all", then at the beginning the ACPI code will actually create a /sysfs entry for hotplug.enabled to allow userspace to change it: /sys/firmware/acpi/hotplug/memory/enabled Which means even we set hotplug.enabled to false at some point, userspace can turn it on again. The only way is to not create this /sysfs entry at the beginning. With "winner-take-all" approach, I don't think we should avoid creating the /sysfs entry. Nor we should introduce arch-specific hook to, i.e. prevent /sysfs entry being changed by userspace. So instead of "winner-take-all" approach, I'll introduce a new kernel command line to allow user to choose between ACPI CPU/memory hotplug vs TDX. This command line should not impact the "software" CPU/memory hotplug even when user choose to use TDX. In this case, this is similar to "winner-take-all" anyway.