On Mon, Aug 5, 2024 at 7:24 AM Jesper Dangaard Brouer <hawk@xxxxxxxxxx> wrote: > > > On 02/08/2024 18.10, Yosry Ahmed wrote: > > On Fri, Aug 2, 2024 at 4:43 AM Jesper Dangaard Brouer <hawk@xxxxxxxxxx> wrote: > >> > >> > >> On 30/07/2024 20.54, Yosry Ahmed wrote: > >>> [..] > >>>> > >>>> Well... I'm still not convinced that it makes sense to have level >= 2 > >>>> be the ongoing flusher. > >>>> > >>>> E.g. if a level 2 cgroup becomes ongoing flusher, and kswapd starts 12 > >>>> NUMA flushes at the same time, then the code will have these 12 kswapd > >>>> threads spin on the lock, until ongoing flusher finishes. That is likely > >>>> what happened above (for a level 1). These 12 spinning (root) flushers > >>>> will not recheck ongoing_flusher and will all flush the root > >>>> (unnecessarily 11 times). > >>> > >>> Hmm regardless of whether or not the level-2 cgroup becomes the > >>> ongoing flusher, the kswapd threads will all spin on the lock anyway > >>> since none of them can be the ongoing flusher until the level-2 cgroup > >>> finishes. Right? > >>> > >>> Is the scenario you have in mind that the level-2 cgroup starts > >>> flushing at the same time as kswapd, so there is a race on who gets to > >>> be the ongoing flusher? In this case as well, whoever gets the lock > >>> will be the ongoing flusher anyway. > >>> > >>> Not allowing whoever is holding the lock to be the ongoing flusher > >>> based on level is only useful when we can have multiple ongoing > >>> flushers (with lock yielding). Right? > >>> > >>> Perhaps I am missing something here. > >>> > >>>> > >>>> So, I don't think it is a good idea to have anything else that the root > >>>> as the ongoing flusher. > >>>> > >>>> Can you explain/convince me why having sub-cgroups as ongoing flusher is > >>>> an advantage? > >>> > >>> I just don't see the benefit of the special casing here as I mentioned > >>> above. If I missed something please let me know. > >>> > >> > >> I do think you missed something. Let me try to explain this in another > >> way. (I hope my frustrations doesn't shine through). > >> > >> The main purpose of the patch is/was to stop the thundering herd of > >> kswapd thread flushing (root-cgrp) at exactly the same time, leading to > >> lock contention. This happens all-the-time/constantly in production. > >> > >> The first versions (where ongoing was limited to root/level=0) solved > >> this 100%. The patches that generalized this to be all levels can > >> become ongoing flush, doesn't solve the problem any-longer! > >> > >> I hope it is clear what fails. E.g. When a level>0 becomes ongoing > >> flusher, and 12 kswapd simultaneously does a level=0/root-cgrp flush, > >> then we have 12 CPU cores spinning on the rstat lock. (These 12 kswapd > >> threads will all go-through completing the flush, as they do not > >> discover/recheck that ongoing flush was previously became their own level). > > > > I think we may be speaking past one another, let me try to clarify :) > > > > I agree with your assessment, all I am saying is that this restriction > > is only needed because of lock yielding, and can be removed after that > > IIUC. > > > > The problem when we allow non-root ongoing flushers now is that when > > the kswapd thread are woken up and the first one of them gets the lock > > and does the flush, it may be find that the ongoing_flusher is already > > set by another non-root flusher that yielded the lock. In this case, > > the following kswapd flushers will spin on the lock instead of waiting > > for the first kswapd to finish. > > > > If we remove lock yielding, then the above scenario cannot happen. > > I think, this is where we disagree/talk-past-each-other. Looking at the > code, I do believe the the situation *also* occurs without any lock > yielding involved. Yes, the situation if far-worse when we have lock > yielding, but it also happens in the default case. > > > When the lock/mutex is held by a flusher, it is guaranteed that > > ongoing_flusher is NULL and can be set by the flusher. In this case, > > we should allow any cgroup to be the ongoing_flusher because there can > > only be one anyway. > > > > With current patch proposal [V8 or V9]. > Assuming we have no lock yielding. > > Do we agree that 12 kswapd threads will be waiting on the lock, when a > level>0 were ongoing flusher when they were started? > Then level>0 finishes being ongoing flushed. > Then kswapd0 gets lock, observe NULL as ongoing, and becomes ongoing. > Then kswapd1 gets lock, observe NULL as ongoing, and becomes ongoing. > Then kswapd2 gets lock, observe NULL as ongoing, and becomes ongoing. > Then kswapd3 gets lock, observe NULL as ongoing, and becomes ongoing. > Then kswapd4 gets lock, observe NULL as ongoing, and becomes ongoing. > Then kswapd5 gets lock, observe NULL as ongoing, and becomes ongoing. > Then kswapd6 gets lock, observe NULL as ongoing, and becomes ongoing. > [etc] How does preventing the level>0 cgroup from becoming the ongoing_flusher change the above scenario? kswapd threads will still observe NULL as ongoing and spin on the lock, right? (Still assuming no lock yielding) > > Please, let me know if I misunderstood my own code, and you believe this > scenario cannot happen. > > When above happens, then patch didn't solve the kswapd thundering herd > issue that we observe in production. > > The point/problem is that once kswapd is waiting on the lock, then code > doesn't re-check the ongoing flusher, and every kswapd thread will be > spinning and every kswapd thread will need to go through the flush. > When a kswapd thread gets the lock, then it will observe ongoing as > NULL, so it cannot detect that another level=0 just were the ongoing. > > --Jesper