On 9/15/21 07:32, David Rientjes wrote: > On Mon, 13 Sep 2021, Vlastimil Babka wrote: > >> While this is no longer a problem in kmemcg context thanks to the accounting >> rewrite in 5.9, the memory waste is still not ideal and it's questionable >> whether it makes sense to perform free object count based control when object >> counts can easily become so much inaccurate. So this patch converts the >> accounting to be based on number of pages only (which is precise) and removes >> the page->pobjects field completely. This is also ultimately simpler. >> > > Thanks for the very detailed explanation, this is very timely for us. > > I'm wondering if we should be concerned about the memory waste even being > possible, though, now that we have the kmemcg accounting change? > > IIUC, because we're accounting objects and not pages, then it *seems* like > we could have a high number of pages but very few objects charged per > page so this memory waste could go unconstrained from any kmemcg > limitation. So the main problem before 5.9 was that there were separate kmem caches per memcg with their own percpu partial lists, so the memory used was determined by caches x cpus x memcgs, now they are shared so it's just caches x cpus. What you're saying would be also true, but relatively much smaller issue than what it was before 5.9. >> To retain the existing set_cpu_partial() heuristic, first calculate the target >> number of objects as previously, but then convert it to target number of pages >> by assuming the pages will be half-filled on average. This assumption might >> obviously also be inaccurate in practice, but cannot degrade to actual number of >> pages being equal to the target number of objects. >> > > I think that's a fair heuristic. > >> We could also skip the intermediate step with target number of objects and >> rewrite the heuristic in terms of pages. However we still have the sysfs file >> cpu_partial which uses number of objects and could break existing users if it >> suddenly becomes number of pages, so this patch doesn't do that. >> >> In practice, after this patch the heuristics limit the size of percpu partial >> list up to 2 pages. In case of a reported regression (which would mean some >> workload has benefited from the previous imprecise object based counting), we >> can tune the heuristics to get a better compromise within the new scheme, while >> still avoid the unexpectedly long percpu partial lists. >> > > Curious if you've tried netperf TCP_RR with this change? This benchmark > was the most significantly improved benchmark that I recall with the > introduction of per-cpu partial slabs for SLUB. If there are any > regressions to be introduced by such an approach, I'm willing to bet that > it would be surfaced with that benchmark. I'll try, thanks for the tip.