On Sat, Aug 12, 2023 at 1:35 AM Michal Hocko <mhocko@xxxxxxxx> wrote: > > On Fri 11-08-23 19:48:14, Shakeel Butt wrote: > > On Fri, Aug 11, 2023 at 7:36 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > > > > > On Fri, Aug 11, 2023 at 7:29 PM Shakeel Butt <shakeelb@xxxxxxxxxx> wrote: > > > > > > > > On Fri, Aug 11, 2023 at 7:12 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > > > > > > > > [...] > > > > > > > > > > I am worried that writing to a stat for flushing then reading will > > > > > increase the staleness window which we are trying to reduce here. > > > > > Would it be acceptable to add a separate interface to explicitly read > > > > > flushed stats without having to write first? If the distinction > > > > > disappears in the future we can just short-circuit both interfaces. > > > > > > > > What is the acceptable staleness time window for your case? It is hard > > > > to imagine that a write+read will always be worse than just a read. > > > > Even the proposed patch can have an unintended and larger than > > > > expected staleness window due to some processing on > > > > return-to-userspace or some scheduling delay. > > > > > > Maybe I am worrying too much, we can just go for writing to > > > memory.stat for explicit stats refresh. > > > > > > Do we still want to go with the mutex approach Michal suggested for > > > do_flush_stats() to support either waiting for ongoing flushes > > > (mutex_lock) or skipping (mutex_trylock)? > > > > I would say keep that as a separate patch. > > Separate patches would be better but please make the mutex conversion > first. We really do not want to have any busy waiting depending on a > sleep exported to the userspace. That is just no-go. +tj@xxxxxxxxxx That makes sense. Taking a step back though, and considering there have been other complaints about unified flushing causing expensive reads from memory.stat [1], I am wondering if we should tackle the fundamental problem. We have a single global rstat lock for flushing, which protects the global per-cgroup counters as far as I understand. A single lock means a lot of contention, which is why we implemented unified flushing on the memcg side in the first place, where we only let one flusher operate and everyone else skip, but that flusher needs to flush the entire tree. This can be unnecessarily expensive (see [1]), and to avoid how expensive it is we sacrifice accuracy (what this patch is about). I am exploring breaking down that lock into per-cgroup locks, where a flusher acquires locks in a top down fashion. This allows for some concurrency in flushing, and makes unified flushing unnecessary. If we retire unified flushing we fix both accuracy and expensive reads at the same time, while not sacrificing performance for concurrent in-kernel flushers. What do you think? I am prototyping something now and running some tests, it seems promising and simple-ish (unless I am missing a big correctness issue). [1] https://lore.kernel.org/lkml/CABWYdi3YNwtPDwwJWmCO-ER50iP7CfbXkCep5TKb-9QzY-a40A@xxxxxxxxxxxxxx/ > > Thanks! > -- > Michal Hocko > SUSE Labs