On Mon, Jul 8, 2024 at 8:26 AM Jesper Dangaard Brouer <hawk@xxxxxxxxxx> wrote: > > > On 28/06/2024 11.39, Jesper Dangaard Brouer wrote: > > > > > > On 28/06/2024 01.34, Shakeel Butt wrote: > >> On Thu, Jun 27, 2024 at 11:18:56PM GMT, Jesper Dangaard Brouer wrote: > >>> Avoid lock contention on the global cgroup rstat lock caused by kswapd > >>> starting on all NUMA nodes simultaneously. At Cloudflare, we observed > >>> massive issues due to kswapd and the specific mem_cgroup_flush_stats() > >>> call inlined in shrink_node, which takes the rstat lock. > >>> > [...] > >>> static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu); > >>> @@ -312,6 +315,45 @@ static inline void __cgroup_rstat_unlock(struct > >>> cgroup *cgrp, int cpu_in_loop) > >>> spin_unlock_irq(&cgroup_rstat_lock); > >>> } > >>> +#define MAX_WAIT msecs_to_jiffies(100) > >>> +/* Trylock helper that also checks for on ongoing flusher */ > >>> +static bool cgroup_rstat_trylock_flusher(struct cgroup *cgrp) > >>> +{ > >>> + bool locked = __cgroup_rstat_trylock(cgrp, -1); > >>> + if (!locked) { > >>> + struct cgroup *cgrp_ongoing; > >>> + > >>> + /* Lock is contended, lets check if ongoing flusher is already > >>> + * taking care of this, if we are a descendant. > >>> + */ > >>> + cgrp_ongoing = READ_ONCE(cgrp_rstat_ongoing_flusher); > >>> + if (cgrp_ongoing && cgroup_is_descendant(cgrp, cgrp_ongoing)) { > >> > >> I wonder if READ_ONCE() and cgroup_is_descendant() needs to happen > >> within in rcu section. On a preemptable kernel, let's say we got > >> preempted in between them, the flusher was unrelated and got freed > >> before we get the CPU. In that case we are accessing freed memory. > >> > > > > I have to think about this some more. > > > > I don't think this is necessary. We are now waiting (for completion) and > not skipping flush, because as part of take down function > cgroup_rstat_exit() is called, which will call cgroup_rstat_flush(). > > > void cgroup_rstat_exit(struct cgroup *cgrp) > { > int cpu; > cgroup_rstat_flush(cgrp); > > Sorry for the late response, I was traveling for a bit. I will take a look at your most recent version shortly. But I do have a comment here. I don't see how this addresses Shakeel's concern. IIUC, if the cgroup was freed after READ_ONCE() (and cgroup_rstat_flush() was called), then cgroup_is_descendant() will access freed memory. We are not holding the lock here so we are not preventing cgroup_rstat_flush() from being called for the freed cgroup, right?