Hello Vlastimil. I know it's too late to review, but I want to give my opinion here. On Tue, Oct 12, 2021 at 03:46:50PM +0200, Vlastimil Babka wrote: > With CONFIG_SLUB_CPU_PARTIAL enabled, SLUB keeps a percpu list of partial > slabs that can be promoted to cpu slab when the previous one is depleted, > without accessing the shared partial list. A slab can be added to this list > by 1) refill of an empty list from get_partial_node() - once we really have to > access the shared partial list, we acquire multiple slabs to amortize the cost > of locking, and 2) first free to a previously full slab - instead of putting > the slab on a shared partial list, we can more cheaply freeze it and put it on > the per-cpu list. > Right. This is how we put slabs to percpu partial list. > To control how large a percpu partial list can grow for a kmem cache, > set_cpu_partial() calculates a target number of free objects on each cpu's > percpu partial list, and this can be also set by the sysfs file cpu_partial. > > However, the tracking of actual number of objects is imprecise, in order to > limit overhead from cpu X freeing an objects to a slab on percpu partial > list of cpu Y. > > Basically, the percpu partial slabs form a single linked list, > and when we add a new slab to the list with current head "oldpage", we set in > the struct page of the slab we're adding: > > page->pages = oldpage->pages + 1; // this is precise > page->pobjects = oldpage->pobjects + (page->objects - page->inuse); > page->next = oldpage; > > Thus the real number of free objects in the slab (objects - inuse) is only > determined at the moment of adding the slab to the percpu partial list, and > further freeing doesn't update the pobjects counter nor propagate it to the > current list head. As Jann reports [1], this can easily lead to large > inaccuracies, where the target number of objects (up to 30 by default) can > translate to the same number of (empty) slab pages on the list. In case 2) > above, we put a slab with 1 free object on the list, thus only increase > page->pobjects by 1, even if there are subsequent frees on the same slab. Jann > has noticed this in practice and so did we [2] when investigating significant > increase of kmemcg usage after switching from SLAB to SLUB. Indeed. SLUB could grow percpu partial list too much before this patch, especially in case 2). number of objects being equal to number of slabs is too imprecise. > While this is no longer a problem in kmemcg context thanks to the accounting > rewrite in 5.9, the memory waste is still not ideal and it's questionable > whether it makes sense to perform free object count based control when object > counts can easily become so much inaccurate. So this patch converts the > accounting to be based on number of pages only (which is precise) and removes > the page->pobjects field completely. This is also ultimately simpler. > > To retain the existing set_cpu_partial() heuristic, first calculate the target > number of objects as previously, but then convert it to target number of pages > by assuming the pages will be half-filled on average. This assumption might > obviously also be inaccurate in practice, but cannot degrade to actual number of > pages being equal to the target number of objects. I have to agree that this half-filled assumption works pretty well and I believe the too-long-partial-list problem has gone. we're controlling its length clearly after this patch. But my one concern here is that actual number of objects in percpu partial list can be decreased when we cannot allocate high order pages. e.g.) oo_order(s->oo) is 3 and we can only allocate order-2 page, it can be shortened 2 times in worst case because the accounting logic assumes order of all slab in the percpu partial list is oo_order(s->oo). I think using object based accounting, and assuming every slab is half-filled would be more consistent when the system is highly fragmented. Thoughts? -- Thank you, You are awesome! Hyeonggon :-)