On Tue, Jul 30, 2024 at 07:13:03PM -0400, David Finkel wrote: > Other mechanisms for querying the peak memory usage of either a process > or v1 memory cgroup allow for resetting the high watermark. Restore > parity with those mechanisms, but with a less racy API. > > For example: > - Any write to memory.max_usage_in_bytes in a cgroup v1 mount resets > the high watermark. > - writing "5" to the clear_refs pseudo-file in a processes's proc > directory resets the peak RSS. > > This change is an evolution of a previous patch, which mostly copied the > cgroup v1 behavior, however, there were concerns about races/ownership > issues with a global reset, so instead this change makes the reset > filedescriptor-local. > > Writing any non-empty string to the memory.peak and memory.swap.peak > pseudo-files reset the high watermark to the current usage for > subsequent reads through that same FD. > > Notably, following Johannes's suggestion, this implementation moves the > O(FDs that have written) behavior onto the FD write(2) path. Instead, on > the page-allocation path, we simply add one additional watermark to > conditionally bump per-hierarchy level in the page-counter. > > Additionally, this takes Longman's suggestion of nesting the > page-charging-path checks for the two watermarks to reduce the number of > common-case comparisons. > > This behavior is particularly useful for work scheduling systems that > need to track memory usage of worker processes/cgroups per-work-item. > Since memory can't be squeezed like CPU can (the OOM-killer has > opinions), these systems need to track the peak memory usage to compute > system/container fullness when binpacking workitems. > > Most notably, Vimeo's use-case involves a system that's doing global > binpacking across many Kubernetes pods/containers, and while we can use > PSI for some local decisions about overload, we strive to avoid packing > workloads too tightly in the first place. To facilitate this, we track > the peak memory usage. However, since we run with long-lived workers (to > amortize startup costs) we need a way to track the high watermark while > a work-item is executing. Polling runs the risk of missing short spikes > that last for timescales below the polling interval, and peak memory > tracking at the cgroup level is otherwise perfect for this use-case. > > As this data is used to ensure that binpacked work ends up with > sufficient headroom, this use-case mostly avoids the inaccuracies > surrounding reclaimable memory. > > Suggested-by: Johannes Weiner <hannes@xxxxxxxxxxx> > Suggested-by: Waiman Long <longman@xxxxxxxxxx> > Acked-by: Johannes Weiner <hannes@xxxxxxxxxxx> > Reviewed-by: Michal Koutný <mkoutny@xxxxxxxx> > Signed-off-by: David Finkel <davidf@xxxxxxxxx> Reviewed-by: Roman Gushchin <roman.gushchin@xxxxxxxxx> Thanks!