On 01.08.19 10:27, Michal Hocko wrote: > On Thu 01-08-19 09:00:45, David Hildenbrand wrote: >> On 01.08.19 08:13, Michal Hocko wrote: >>> On Wed 31-07-19 16:43:58, David Hildenbrand wrote: >>>> On 31.07.19 16:37, Michal Hocko wrote: >>>>> On Wed 31-07-19 16:21:46, David Hildenbrand wrote: >>>>> [...] >>>>>>> Thinking about it some more, I believe that we can reasonably provide >>>>>>> both APIs controlable by a command line parameter for backwards >>>>>>> compatibility. It is the hotplug code to control sysfs APIs. E.g. >>>>>>> create one sysfs entry per add_memory_resource for the new semantic. >>>>>> >>>>>> Yeah, but the real question is: who needs it. I can only think about >>>>>> some DIMM scenarios (some, not all). I would be interested in more use >>>>>> cases. Of course, to provide and maintain two APIs we need a good reason. >>>>> >>>>> Well, my 3TB machine that has 7 movable nodes could really go with less >>>>> than >>>>> $ find /sys/devices/system/memory -name "memory*" | wc -l >>>>> 1729> >>>> >>>> The question is if it would be sufficient to increase the memory block >>>> size even further for these kinds of systems (e.g., via a boot parameter >>>> - I think we have that on uv systems) instead of having blocks of >>>> different sizes. Say, 128GB blocks because you're not going to hotplug >>>> 128MB DIMMs into such a system - at least that's my guess ;) >>> >>> The system has >>> [ 0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x10000000000-0x17fffffffff] >>> [ 0.000000] ACPI: SRAT: Node 2 PXM 2 [mem 0x80000000000-0x87fffffffff] >>> [ 0.000000] ACPI: SRAT: Node 3 PXM 3 [mem 0x90000000000-0x97fffffffff] >>> [ 0.000000] ACPI: SRAT: Node 4 PXM 4 [mem 0x100000000000-0x107fffffffff] >>> [ 0.000000] ACPI: SRAT: Node 5 PXM 5 [mem 0x110000000000-0x117fffffffff] >>> [ 0.000000] ACPI: SRAT: Node 6 PXM 6 [mem 0x180000000000-0x183fffffffff] >>> [ 0.000000] ACPI: SRAT: Node 7 PXM 7 [mem 0x190000000000-0x191fffffffff] >>> >>> hotplugable memory. I would love to have those 7 memory blocks to work >>> with. Any smaller grained split is just not helping as the platform will >>> not be able to hotremove it anyway. >>> >> >> So the smallest granularity in your system is indeed 128GB (btw, nice >> system, I wish I had something like that), the biggest one 512GB. >> >> Using a memory block size of 128GB would imply on a 3TB system 24 memory >> blocks - which is tolerable IMHO. Especially, performance-wise there >> shouldn't be a real difference to 7 blocks. Hotunplug triggered via ACPI >> will take care of offlining the right DIMMs. > > The problem with a fixed size memblock is that you might not know how > much memory you will have until much later after the boot. For example, > it should be quite reasonable to expect that this particular machine > would boot with node 0 only and have additional boards with memory added > during runtime. How big the memblock should be then? And I believe that > the virtualization usecase is similar in that regards. You get memory on > demand. > Well, via a kernel parameter you could make it configurable (just as on UV systems). Not optimal though, but would work in many scenarios. I see virtualization environments rather moving away from inflexible huge DIMMs towards hotplugging smaller granularities (hyper-v balloon, xen balloon, virtio-mem). >> Of course, 7 blocks would be nicer, but as discussed, not possible with >> the current ABI. > > As I've said, if we want to move forward we have to change the API we > have right now. With backward compatible option of course. > I am not convinced a new API is really worth it yet. -- Thanks, David / dhildenb