On Mon, Sep 11, 2023 at 1:21 PM Tejun Heo <tj@xxxxxxxxxx> wrote: > > Hello, > > On Mon, Sep 11, 2023 at 01:01:25PM -0700, Wei Xu wrote: > > Yes, it is the same test (10K contending readers). The kernel change > > is to remove stats_user_flush_mutex from mem_cgroup_user_flush_stats() > > so that the concurrent mem_cgroup_user_flush_stats() requests directly > > contend on cgroup_rstat_lock in cgroup_rstat_flush(). > > I don't think it'd be a good idea to twist rstat and other kernel internal > code to accommodate 10k parallel readers. If we want to support that, let's > explicitly support that by implementing better batching in the read path. > The only guarantee you need is that there has been at least one flush since > the read attempt started, so we can do sth like the following in the read > path: > > 1. Grab a waiter lock. Remember the current timestamp. > > 2. Try lock flush mutex. If obtained, drop the waiter lock, flush. Regrab > the waiter lock, update the latest flush time to my start time, wake up > waiters on the waitqueue (maybe do custom wakeups based on start time?). > > 3. Release the waiter lock and sleep on the waitqueue. > > 4. When woken up, regarb the waiter lock, compare whether the latest flush > timestamp is later than my start time, if so, return the latest result. > If not go back to #2. > > Maybe the above isn't the best way to do it but you get the general idea. > When you have that many concurrent readers, most of them won't need to > actually flush. I am testing something vaguely similar to this conceptually, but doesn't depend on timestamps. I replaced the mutex with a semaphore, and I added a fallback logic to unified flushing with a timeout: static void mem_cgroup_user_flush_stats(struct mem_cgroup *memcg) { static DEFINE_SEMAPHORE(user_flush_sem, 1); if (atomic_read(&stats_flush_order) <= STATS_FLUSH_THRESHOLD) return; if (!down_timeout(&user_flush_sem, msecs_to_jiffies(1))) { do_stats_flush(memcg); up(&user_flush_sem); } else { do_unified_stats_flush(true); } } In do_unified_stats_flush(), I added a wait argument. If set, the caller will wait for any ongoing flushers before returning (but it never attempts to flush, so no contention on the underlying rstat lock). I implemented this using completions. I am running some tests now, but this should make sure userspace read latency is bounded by 1ms + unified flush time. We basically attempt to flush our subtree only, if we can't after 1ms, we fallback to unified flushing. Another benefit I am seeing here is that I tried switching in-kernel flushers to also use the completion in do_unified_stats_flush(). Basically instead of skipping entirely when someone else is flushing, they just wait for them to finish (without being serialized or contending the lock). I see no regressions in my parallel reclaim benchmark. This should make sure no one ever skips a flush, while still avoiding too much serialization/contention. I suspect this should make reclaim heuristics (and other in-kernel flushers) slightly better. I will run Wei's benchmark next to see how userspace read latency is affected. > > Thanks. > > -- > tejun