On Wed, Jul 17, 2024 at 1:04 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > On Tue, Jul 16, 2024 at 06:44:11AM -1000, Tejun Heo wrote: > > Hello, > > > > On Tue, Jul 16, 2024 at 03:48:17PM +0200, Michal Hocko wrote: > > ... > > > > This behavior is particularly useful for work scheduling systems that > > > > need to track memory usage of worker processes/cgroups per-work-item. > > > > Since memory can't be squeezed like CPU can (the OOM-killer has > > > > opinions), these systems need to track the peak memory usage to compute > > > > system/container fullness when binpacking workitems. > > > > Swap still has bad reps but there's nothing drastically worse about it than > > page cache. ie. If you're under memory pressure, you get thrashing one way > > or another. If there's no swap, the system is just memlocking anon memory > > even when they are a lot colder than page cache, so I'm skeptical that no > > swap + mostly anon + kernel OOM kills is a good strategy in general > > especially given that the system behavior is not very predictable under OOM > > conditions. > > > > > As mentioned down the email thread, I consider usefulness of peak value > > > rather limited. It is misleading when memory is reclaimed. But > > > fundamentally I do not oppose to unifying the write behavior to reset > > > values. > > > > The removal of resets was intentional. The problem was that it wasn't clear > > who owned those counters and there's no way of telling who reset what when. > > It was easy to accidentally end up with multiple entities that think they > > can get timed measurement by resetting. > > > > So, in general, I don't think this is a great idea. There are shortcomings > > to how memory.peak behaves in that its meaningfulness quickly declines over > > time. This is expected and the rationale behind adding memory.peak, IIRC, > > was that it was difficult to tell the memory usage of a short-lived cgroup. > > > > If we want to allow peak measurement of time periods, I wonder whether we > > could do something similar to pressure triggers - ie. let users register > > watchers so that each user can define their own watch periods. This is more > > involved but more useful and less error-inducing than adding reset to a > > single counter. > > > > Johannes, what do you think? > > I'm also not a fan of the ability to reset globally. > > I seem to remember a scheme we discussed some time ago to do local > state tracking without having the overhead in the page counter > fastpath. The new data that needs to be tracked is a pc->local_peak > (in the page_counter) and an fd->peak (in the watcher's file state). > > 1. Usage peak is tracked in pc->watermark, and now also in pc->local_peak. > > 2. Somebody opens the memory.peak. Initialize fd->peak = -1. > > 3. If they write, set fd->peak = pc->local_peak = usage. > > 4. Usage grows. > > 5. They read(). A conventional reader has fd->peak == -1, so we return > pc->watermark. If the fd has been written to, return max(fd->peak, pc->local_peak). > > 6. Usage drops. > > 7. New watcher opens and writes. Bring up all existing watchers' > fd->peak (that aren't -1) to pc->local_peak *iff* latter is bigger. > Then set the new fd->peak = pc->local_peak = current usage as in 3. > > 8. See 5. again for read() from each watcher. > > This way all fd's can arbitrarily start tracking new local peaks with > write(). The operation in the charging fast path is cheap. The write() > is O(existing_watchers), which seems reasonable. It's fully backward > compatible with conventional open() + read() users. That scheme seems viable, but it's a lot more work to implement and maintain than a simple global reset. Since that scheme maintains a separate pc->local_peak, it's not mutually exclusive with implementing a global reset now. (as long as we reserve a way to distinguish the different kinds of writes). As discussed on other sub-threads, this might be too niche to be worth the significant complexity of avoiding a global reset. (especially when users would likely be moving from cgroups v1 which does have a global reset) Thanks, -- David Finkel Senior Principal Software Engineer, Core Services