On 31.07.19 15:25, Michal Hocko wrote: > On Wed 31-07-19 15:12:12, David Hildenbrand wrote: >> On 31.07.19 14:43, Michal Hocko wrote: >>> On Wed 31-07-19 14:22:13, David Hildenbrand wrote: >>>> Each memory block spans the same amount of sections/pages/bytes. The size >>>> is determined before the first memory block is created. No need to store >>>> what we can easily calculate - and the calculations even look simpler now. >>> >>> While this cleanup helps a bit, I am not sure this is really worth >>> bothering. I guess we can agree when I say that the memblock interface >>> is suboptimal (to put it mildly). Shouldn't we strive for making it >>> a real hotplug API in the future? What do I mean by that? Why should >>> be any memblock fixed in size? Shouldn't we have use hotplugable units >>> instead (aka pfn range that userspace can work with sensibly)? Do we >>> know of any existing userspace that would depend on the current single >>> section res. 2GB sized memblocks? >> >> Short story: It is already ABI (e.g., >> /sys/devices/system/memory/block_size_bytes) - around since 2005 (!) - >> since we had memory block devices. >> >> I suspect that it is mainly manually used. But I might be wrong. > > Any pointer to the real userspace depending on it? Most usecases I am > aware of rely on udev events and either onlining or offlining the memory > in the handler. Yes, that's also what I know - onlining and triggering kexec(). On s390x, admins online sub-increments to selectively add memory to a VM - but we could still emulate that by adding memory for that use case in the kernel in the current granularity. See https://books.google.de/books?id=afq4CgAAQBAJ&pg=PA117&lpg=PA117&dq=/sys/devices/system/memory/block_size_bytes&source=bl&ots=iYk_vW5O4G&sig=ACfU3U0s-O-SOVaQO-7HpKO5Hj866w9Pxw&hl=de&sa=X&ved=2ahUKEwjOjPqIot_jAhVPfZoKHcxpAqcQ6AEwB3oECAgQAQ#v=onepage&q=%2Fsys%2Fdevices%2Fsystem%2Fmemory%2Fblock_size_bytes&f=false > > I know we have documented this as an ABI and it is really _sad_ that > this ABI didn't get through normal scrutiny any user visible interface > should go through but these are sins of the past... A quick google search indicates that Kata containers queries the block size: https://github.com/kata-containers/runtime/issues/796 Powerpc userspace queries it: https://groups.google.com/forum/#!msg/powerpc-utils-devel/dKjZCqpTxus/AwkstV2ABwAJ I can imagine that ppc dynamic memory onlines only pieces of added memory - DIMMs AFAIK (haven't looked at the details). There might be more users. > >> Long story: >> >> How would you want to number memory blocks? At least no longer by phys >> index. For now, memory blocks are ordered and numbered by their block id. > > memory_${mem_section_nr_of_start_pfn} > Fair enough, although this could break some scripts where people manually offline/online specific blocks. (but who knows what people/scripts do :( ) >> Admins might want to online parts of a DIMM MOVABLE/NORMAL, to more >> reliably use huge pages but still have enough space for kernel memory >> (e.g., page tables). They might like that a DIMM is actually a set of >> memory blocks instead of one big chunk. > > They might. Do they though? There are many theoretical usecases but > let's face it, there is a cost given to the current state. E.g. the > number of memblock directories is already quite large on machines with a > lot of memory even though they use large blocks. That has negative > implications already (e.g. the number of events you get, any iteration > on the /sys etc.). Also 2G memblocks are quite arbitrary and they > already limit the above usecase some, right? I mean there are other theoretical issues: Onlining a very big DIMM in one shot might trigger OOM, while slowly adding/onlining would currently works. Who knows if that is relevant in practice. Also, it would break the current use case of memtrace, which removes memory in a granularity that wasn't added. But luckily, memtrace is an exception :) > >> IOW: You can consider it a restriction to add e.g., DIMMs only in one >> bigger chunks. >> >>> >>> All that being said, I do not oppose to the patch but can we start >>> thinking about the underlying memblock limitations rather than micro >>> cleanups? >> >> I am pro cleaning up what we have right now, not expect it to eventually >> change some-when in the future. (btw, I highly doubt it will change) > > I do agree, but having the memblock fixed size doesn't really go along > with variable memblock size if we ever go there. But as I've said I am > not really against the patch. Fair enough, for now I am not convinced that we will actually see variable memory blocks in the near future. Thanks for the discussion (I was thinking about the same concept a while back when trying to find out if there could be an easy way to identify which memory blocks belong to a single DIMM you want to eventually unplug and therefore online it all to the MOVABLE zone). -- Thanks, David / dhildenb