On Wed 31-07-19 15:12:12, David Hildenbrand wrote: > On 31.07.19 14:43, Michal Hocko wrote: > > On Wed 31-07-19 14:22:13, David Hildenbrand wrote: > >> Each memory block spans the same amount of sections/pages/bytes. The size > >> is determined before the first memory block is created. No need to store > >> what we can easily calculate - and the calculations even look simpler now. > > > > While this cleanup helps a bit, I am not sure this is really worth > > bothering. I guess we can agree when I say that the memblock interface > > is suboptimal (to put it mildly). Shouldn't we strive for making it > > a real hotplug API in the future? What do I mean by that? Why should > > be any memblock fixed in size? Shouldn't we have use hotplugable units > > instead (aka pfn range that userspace can work with sensibly)? Do we > > know of any existing userspace that would depend on the current single > > section res. 2GB sized memblocks? > > Short story: It is already ABI (e.g., > /sys/devices/system/memory/block_size_bytes) - around since 2005 (!) - > since we had memory block devices. > > I suspect that it is mainly manually used. But I might be wrong. Any pointer to the real userspace depending on it? Most usecases I am aware of rely on udev events and either onlining or offlining the memory in the handler. I know we have documented this as an ABI and it is really _sad_ that this ABI didn't get through normal scrutiny any user visible interface should go through but these are sins of the past... > Long story: > > How would you want to number memory blocks? At least no longer by phys > index. For now, memory blocks are ordered and numbered by their block id. memory_${mem_section_nr_of_start_pfn} > Admins might want to online parts of a DIMM MOVABLE/NORMAL, to more > reliably use huge pages but still have enough space for kernel memory > (e.g., page tables). They might like that a DIMM is actually a set of > memory blocks instead of one big chunk. They might. Do they though? There are many theoretical usecases but let's face it, there is a cost given to the current state. E.g. the number of memblock directories is already quite large on machines with a lot of memory even though they use large blocks. That has negative implications already (e.g. the number of events you get, any iteration on the /sys etc.). Also 2G memblocks are quite arbitrary and they already limit the above usecase some, right? > IOW: You can consider it a restriction to add e.g., DIMMs only in one > bigger chunks. > > > > > All that being said, I do not oppose to the patch but can we start > > thinking about the underlying memblock limitations rather than micro > > cleanups? > > I am pro cleaning up what we have right now, not expect it to eventually > change some-when in the future. (btw, I highly doubt it will change) I do agree, but having the memblock fixed size doesn't really go along with variable memblock size if we ever go there. But as I've said I am not really against the patch. -- Michal Hocko SUSE Labs