On 01.08.19 08:13, Michal Hocko wrote: > On Wed 31-07-19 16:43:58, David Hildenbrand wrote: >> On 31.07.19 16:37, Michal Hocko wrote: >>> On Wed 31-07-19 16:21:46, David Hildenbrand wrote: >>> [...] >>>>> Thinking about it some more, I believe that we can reasonably provide >>>>> both APIs controlable by a command line parameter for backwards >>>>> compatibility. It is the hotplug code to control sysfs APIs. E.g. >>>>> create one sysfs entry per add_memory_resource for the new semantic. >>>> >>>> Yeah, but the real question is: who needs it. I can only think about >>>> some DIMM scenarios (some, not all). I would be interested in more use >>>> cases. Of course, to provide and maintain two APIs we need a good reason. >>> >>> Well, my 3TB machine that has 7 movable nodes could really go with less >>> than >>> $ find /sys/devices/system/memory -name "memory*" | wc -l >>> 1729> >> >> The question is if it would be sufficient to increase the memory block >> size even further for these kinds of systems (e.g., via a boot parameter >> - I think we have that on uv systems) instead of having blocks of >> different sizes. Say, 128GB blocks because you're not going to hotplug >> 128MB DIMMs into such a system - at least that's my guess ;) > > The system has > [ 0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x10000000000-0x17fffffffff] > [ 0.000000] ACPI: SRAT: Node 2 PXM 2 [mem 0x80000000000-0x87fffffffff] > [ 0.000000] ACPI: SRAT: Node 3 PXM 3 [mem 0x90000000000-0x97fffffffff] > [ 0.000000] ACPI: SRAT: Node 4 PXM 4 [mem 0x100000000000-0x107fffffffff] > [ 0.000000] ACPI: SRAT: Node 5 PXM 5 [mem 0x110000000000-0x117fffffffff] > [ 0.000000] ACPI: SRAT: Node 6 PXM 6 [mem 0x180000000000-0x183fffffffff] > [ 0.000000] ACPI: SRAT: Node 7 PXM 7 [mem 0x190000000000-0x191fffffffff] > > hotplugable memory. I would love to have those 7 memory blocks to work > with. Any smaller grained split is just not helping as the platform will > not be able to hotremove it anyway. > So the smallest granularity in your system is indeed 128GB (btw, nice system, I wish I had something like that), the biggest one 512GB. Using a memory block size of 128GB would imply on a 3TB system 24 memory blocks - which is tolerable IMHO. Especially, performance-wise there shouldn't be a real difference to 7 blocks. Hotunplug triggered via ACPI will take care of offlining the right DIMMs. Of course, 7 blocks would be nicer, but as discussed, not possible with the current ABI. What we could do right now is finally make "cat /sys/devices/system/memory/memory99/phys_device" indicate on x86-64 to which DIMM an added memory range belongs (if applicable). For now, it's only used on s390x. We could store for each memory block the "phys_index" - a.k.a. section number of the lowest memory block of a add_memory() range. This would at least allow user space to identify all memory blocks that logically belong together (DIMM) without ABI changes. -- Thanks, David / dhildenb