Hello, On Mon, Sep 11, 2023 at 01:01:25PM -0700, Wei Xu wrote: > Yes, it is the same test (10K contending readers). The kernel change > is to remove stats_user_flush_mutex from mem_cgroup_user_flush_stats() > so that the concurrent mem_cgroup_user_flush_stats() requests directly > contend on cgroup_rstat_lock in cgroup_rstat_flush(). I don't think it'd be a good idea to twist rstat and other kernel internal code to accommodate 10k parallel readers. If we want to support that, let's explicitly support that by implementing better batching in the read path. The only guarantee you need is that there has been at least one flush since the read attempt started, so we can do sth like the following in the read path: 1. Grab a waiter lock. Remember the current timestamp. 2. Try lock flush mutex. If obtained, drop the waiter lock, flush. Regrab the waiter lock, update the latest flush time to my start time, wake up waiters on the waitqueue (maybe do custom wakeups based on start time?). 3. Release the waiter lock and sleep on the waitqueue. 4. When woken up, regarb the waiter lock, compare whether the latest flush timestamp is later than my start time, if so, return the latest result. If not go back to #2. Maybe the above isn't the best way to do it but you get the general idea. When you have that many concurrent readers, most of them won't need to actually flush. Thanks. -- tejun