On Wed, Nov 24, 2021 at 01:55:16PM +0900, David Stevens wrote: > > >> And if you look at things from the context of > > >> a specific userspace process, there will be other processes running > > >> and using memory. So while that statement is true with respect to this > > >> change, that is also true without this change. The specific details > > >> might be changed by the proposed parameter, but it wouldn't be > > >> introducing any fundamentally new behavior to Linux. > > >> > > > > > > Please note that the hyper-v balloon just recently switched to using > > > adjust_managed_page_count() proper accounting reasons: > > > > > > commit d1df458cbfdb0c3384c03c7fbcb1689bc02a746c > > > Author: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> > > > Date: Wed Dec 2 17:12:45 2020 +0100 > > > > > > hv_balloon: do adjust_managed_page_count() when ballooning/un-ballooning > > > > > > Unlike virtio_balloon/virtio_mem/xen balloon drivers, Hyper-V balloon driver > > > does not adjust managed pages count when ballooning/un-ballooning and this leads > > > to incorrect stats being reported, e.g. unexpected 'free' output. > > > > > > Note, the calculation in post_status() seems to remain correct: ballooned out > > > pages are never 'available' and we manually add dm->num_pages_ballooned to > > > 'commited'. > > > > > I saw this commit, but it wasn't entirely clear to me the problem it > was addressing was. Is it the issue Michael pointed out on v1 of my > patch set, where memory in the balloon shouldn't be included in the > free stat reported to the device? This version of my patch should > address that specific issue. Managed page count is linux kernel > specific metadata, so there's no fundamental reason that it needs to > line up exactly with anything reported via the virtio-balloon API. > > > >> That approach would require a lot of changes to userspace - probably > > >> nearly everywhere that uses _SC_PHYS_PAGES or get_phys_pages, or > > >> anywhere that parses /proc/meminfo. Actually properly using "logically > > >> offline pages" would require an additional API for monitoring changes > > >> to the value, and updating to that sort of listener API would not be a > > >> localized change, especially since most programs do not account for > > >> memory hotplug and just use the amount of physical memory during > > >> initialization. Realistically, nearly all of the callers would simply > > >> add together "logically offline pages" and MemTotal. > > > > > > I'd appreciate a more generic approach for user space to figure out the > > > "initial memory size" in a virtualized environment than adding > > > some module parameter to virtio-balloon -- if that makes sense. > > > > > > MemTotal as is expresses how much memory the buddy currently manages, > > > for example, excluding early allocations during boot, excluding actually > > > unplugged memory and excluding logically unplugged memory. Adjusting that > > > value makes perfect sense for virtio-balloon without deflate-on-oom. > > > > > That's a definition of how MemTotal is implemented, but it's not > really a specification of the MemTotal API. The closest thing to a > real specification I can find is "Total usable RAM (i.e., physical RAM > minus a few reserved bits and the kernel binary code)", from the proc > man pages. I think there is quite a bit of leeway in changing how > exactly MemTotal is implemented without violating the (quite vague) > specification or changing any observable semantics of the API. In > particular, leaving the pages in the balloon as part of MemTotal is > essentially indistinguishable from simply having a non-OOM killable > process locking an equivalent amount of memory. So this proposal isn't > really introducing any fundamentally new behavior to the Linux kernel. > > > >> It's also not clear to me what utility the extra information would > > >> provide to userspace. If userspace wants to know how much memory is > > >> available, they should use MemAvailable. If userspace wants to have a > > >> rough estimate for the maximum amount of memory in the system, they > > >> would add together MemTotal and "logically offline pages". The value > > >> of MemTotal with a no-deflate-on-oom virtio-balloon is a value with a > > >> vague meaning that lies somewhere between the "maximum amount of > > >> memory" and the "current amount of memory". I don't really see any > > >> situations where it should clearly be used over one of MemAvailable or > > >> MemTotal + "logically offline pages". > > > > > > The issue is that any application that relies on MemTotal in a virtualized > > > environment is most probably already suboptimal in some cases. You can > > > rely on it and actually later someone will unplug (inflate balloon) > > > memory or plug (deflate balloon) memory. Even MemAvailable is suboptimal > > > because what about two applications that rely on that information at > > > the same time? > > > > > > > BTW, the general issue here is that "we don't know what the hypervisor > > will do". > > I do agree that this is a significant problem. I would expand on it a > bit more, to be "since we don't know what the hypervisor will do, we > don't know how to treat memory in the balloon". The proposed module > parameter is more or less a mechanism to allow the system > administrator to tell the virtio_balloon driver how the hypervisor > behaves. Now that you put it that way, it looks more like this should be a feature bit not a module parameter. > And if the hypervisor will give memory back to the guest when > the guest needs it, then I don't think it's not necessary to logically > unplug the memory. Ideally we would also pair this with sending a signal to device that memory is needed. > It might be a bit cleaner to explicitly address this in the > virtio_balloon protocol. We could add a min_num_pages field to the > balloon config, with semantics along the lines of "The host will > respond to memory pressure in the guest by deflating the balloon down > to min_num_pages, unless it would cause system instability in the > host". Given that feature, I think it would be reasonable to only > consider min_num_pages as logically unplugged. Okay. I think I would do it a bit differently though, make num_pages be the min_num_pages, and add an extra_num_pages field for memory that is nice to have but ok to drop. As long as we are here, can we add a page_shift field please so more than 2^44 bytes can be requested? > > Maybe "MemMax" actually could make sense, where we expose the maximum > > "MemTotal" we had so far since we were up an running. So the semantics > > wouldn't be "maximum possible", because we don't know that, but instead > > "maximum we had". > > Rather than add a new API, I think it is much better to make existing > APIs behave closer to how they behave in a non-virtualized > environment. It is true that we could go through and fix a limited > number of special user space applications, but sysconf(_SC_PHYS_PAGES) > and /proc/meminfo are not special APIs. Fixing every application that > uses them is not feasible, especially when taking into account systems > with closed-source applications (e.g. Android). Also, while MemMax is > well defined, it has the same issues you brought up earlier - > specifically, applications don't know whether the hypervisor will > actually ever provide MemMax again, and they don't know whether MemMax > is actually the realy maximum amount of memory that could be available > in the future. It's not clear to me that it's significantly better or > more useful to userspace than simply changing how MemTotal is > implemented. > > -David Agree on trying to avoid changing applications, limiting changes to device and guest kernel, this has a lot of value. -- MST _______________________________________________ Virtualization mailing list Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/virtualization