On Thu, Mar 11, 2021 at 12:52 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > > Hi, Butt, > > Shakeel Butt <shakeelb@xxxxxxxxxx> writes: > > > On Wed, Mar 10, 2021 at 4:47 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > >> > >> From: Huang Ying <ying.huang@xxxxxxxxx> > >> > >> In shrink_node(), to determine whether to enable cache trim mode, the > >> LRU size is gotten via lruvec_page_state(). That gets the value from > >> a per-CPU counter (mem_cgroup_per_node->lruvec_stat[]). The error of > >> the per-CPU counter from CPU local counting and the descendant memory > >> cgroups may cause some issues. We run into this in 0-Day performance > >> test. > >> > >> 0-Day uses the RAM file system as root file system, so the number of > >> the reclaimable file pages is very small. In the swap testing, the > >> inactive file LRU list will become almost empty soon. But the size of > >> the inactive file LRU list gotten from the per-CPU counter may keep a > >> much larger value (say, 33, 50, etc.). This will enable cache trim > >> mode, but nothing can be scanned in fact. The following pattern > >> repeats for long time in the test, > >> > >> priority inactive_file_size cache_trim_mode > >> 12 33 0 > >> 11 33 0 > >> ... > >> 6 33 0 > >> 5 33 1 > >> ... > >> 1 33 1 > >> > >> That is, the cache_trim_mode will be enabled wrongly when the scan > >> priority decreases to 5. And the problem will not be recovered for > >> long time. > >> > >> It's hard to get the more accurate size of the inactive file list > >> without much more overhead. And it's hard to estimate the error of > >> the per-CPU counter too, because there may be many descendant memory > >> cgroups. But after the actual scanning, if nothing can be scanned > >> with the cache trim mode, it should be wrong to enable the cache trim > >> mode. So we can retry with the cache trim mode disabled. This patch > >> implement this policy. > > > > Instead of playing with the already complicated heuristics, we should > > improve the accuracy of the lruvec stats. Johannes already fixed the > > memcg stats using rstat infrastructure and Tejun has suggestions on > > how to use rstat infrastructure efficiently for lruvec stats at > > https://lore.kernel.org/linux-mm/YCFgr300eRiEZwpL@xxxxxxxxxxxxxxx/. > > Thanks for your information! It should be better if we can improve the > accuracy of lruvec stats without much overhead. But that may be not a > easy task. > > If my understanding were correct, what Tejun suggested is to add a fast > read interface to rstat to be used in hot path. And its accuracy is > similar as that of traditional per-CPU counter. But if we can regularly > update the lruvec rstat with something like vmstat_update(), that should > be OK for the issue described in this patch. > This is also my understanding. Tejun, please correct us if we misunderstood you. BTW Johannes was working on rstat-based lruvec stats patch. Johannes, are you planning to work on the optimization Tejun has suggested.