On Thu, Mar 23, 2023 at 11:08 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > On Thu, Mar 23, 2023 at 10:27 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > > > On Thu, Mar 23, 2023 at 09:01:12AM -0700, Yosry Ahmed wrote: > > > On Thu, Mar 23, 2023 at 8:56 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > > > > > > > On Thu, Mar 23, 2023 at 04:00:34AM +0000, Yosry Ahmed wrote: > > > > > @@ -644,26 +644,26 @@ static void __mem_cgroup_flush_stats(void) > > > > > return; > > > > > > > > > > flush_next_time = jiffies_64 + 2*FLUSH_TIME; > > > > > - cgroup_rstat_flush(root_mem_cgroup->css.cgroup, false); > > > > > + cgroup_rstat_flush(root_mem_cgroup->css.cgroup, may_sleep); > > > > > > > > How is it safe to call this with may_sleep=true when it's holding the > > > > stats_flush_lock? > > > > > > stats_flush_lock is always called with trylock, it is only used today > > > so that we can skip flushing if another cpu is already doing a flush > > > (which is not 100% correct as they may have not finished flushing yet, > > > but that's orthogonal here). So I think it should be safe to sleep as > > > no one can be blocked waiting for this spinlock. > > > > I see. It still cannot sleep while the lock is held, though, because > > preemption is disabled. Make sure you have all lock debugging on while > > testing this. > > Thanks for pointing this out, will do. > > > > > > Perhaps it would be better semantically to replace the spinlock with > > > an atomic test and set, instead of having a lock that can only be used > > > with trylock? > > > > It could be helpful to clarify what stats_flush_lock is protecting > > first. Keep in mind that locks should protect data, not code paths. > > > > Right now it's doing multiple things: > > > > 1. It protects updates to stats_flush_threshold > > 2. It protects updates to flush_next_time > > 3. It serializes calls to cgroup_rstat_flush() based on those ratelimits > > > > However, > > > > 1. stats_flush_threshold is already an atomic > > > > 2. flush_next_time is not atomic. The writer is locked, but the reader > > is lockless. If the reader races with a flush, you could see this: > > > > if (time_after(jiffies, flush_next_time)) > > spin_trylock() > > flush_next_time = now + delay > > flush() > > spin_unlock() > > spin_trylock() > > flush_next_time = now + delay > > flush() > > spin_unlock() > > > > which means we already can get flushes at a higher frequency than > > FLUSH_TIME during races. But it isn't really a problem. > > > > The reader could also see garbled partial updates, so it needs at > > least READ_ONCE and WRITE_ONCE protection. > > > > 3. Serializing cgroup_rstat_flush() calls against the ratelimit > > factors is currently broken because of the race in 2. But the race > > is actually harmless, all we might get is the occasional earlier > > flush. If there is no delta, the flush won't do much. And if there > > is, the flush is justified. > > > > In summary, it seems to me the lock can be ditched altogether. All the > > code needs is READ_ONCE/WRITE_ONCE around flush_next_time. > > Thanks a lot for this analysis. I agree that the lock can be removed > with proper READ_ONCE/WRITE_ONCE, but I think there is another purpose > of the lock that we are missing here. > > I think one other purpose of the lock is avoiding a thundering herd > problem on cgroup_rstat_lock, particularly from reclaim context, as > mentioned by the log of commit aa48e47e3906 ("memcg: infrastructure > to flush memcg stats"). > > While testing, I did notice that removing this lock indeed causes a > thundering herd problem if we have a lot of concurrent reclaimers. The > trylock makes sure we abort immediately if someone else is flushing -- > which is not ideal because that flusher might have just started, and > we may end up reading stale data anyway. > > This is why I suggested replacing the lock by an atomic, and do > something like this if we want to maintain the current behavior: > > static void __mem_cgroup_flush_stats(void) > { > ... > if (atomic_xchg(&ongoing_flush, 1)) > return; > ... > atomic_set(&ongoing_flush, 0) > } > > Alternatively, if we want to change the behavior and wait for the > concurrent flusher to finish flushing, we can maybe spin until > ongoing_flush goes back to 0 and then return: > > static void __mem_cgroup_flush_stats(void) > { > ... > if (atomic_xchg(&ongoing_flush, 1)) { > /* wait until the ongoing flusher finishes to get updated stats */ > while (atomic_read(&ongoing_flush) {}; > return; > } > /* flush the stats ourselves */ > ... > atomic_set(&ongoing_flush, 0) > } > > WDYT? I would go with your first approach i.e. no spinning.