Re: [PATCH] mm/list_lru: allocate on first insert instead of allocation

Kairui Song <ryncsn@xxxxxxxxx> · Mon, 3 Mar 2025 12:58:55 +0800

On Sun, Mar 2, 2025 at 2:33 AM Vlastimil Babka <vbabka@xxxxxxx> wrote:
>
> On 2/28/25 12:38, Jingxiang Zeng wrote:
> > From: Zeng Jingxiang <linuszeng@xxxxxxxxxxx>
> >
> > It is observed that each time the memcg_slab_post_alloc_hook function
> > and the zswap_store function are executed, xa_load will be executed
> > once through the following path, which adds unnecessary overhead.
> > This patch optimizes this part of the code. When a new mlru is
> > inserted into list_lru, xa_load is only executed once, and other slab
> > requests of the same type will not be executed repeatedly.
> >
> > __memcg_slab_post_alloc_hook
> > ->memcg_list_lru_alloc
> > ->->memcg_list_lru_allocated
> > ->->->xa_load
> >
> > zswap_store
> > ->memcg_list_lru_alloc
> > ->->memcg_list_lru_allocated
> > ->->->xa_load
>
> How do you know it's xa_load itself that's the issue?
>
> I think you might be able to eliminate some call overhead easily:
> - move list_lru_memcg_aware() and memcg_list_lru_allocated() to list_lru.h
> - make memcg_list_lru_alloc() also a static inline in list_lru.h, so it does
> the list_lru_memcg_aware() and memcg_list_lru_allocated() checks inline (can
> be even likely()) and then call __memcg_list_lru_alloc() which is renamed
> from the current memcg_list_lru_alloc() but the checks moved away.
>
> The result is callers of memcg_list_lru_alloc() will (in the likely case)
> only perform a direct call to xa_load() in xarray.c and not additional call
> through memcg_list_lru_alloc() in list_lru.c
>

Hi all,

I think Jinxiang's test with a different number of cgroups showed the
xa_load here is indeed an overhead here, and actually many burdens,
like the objcg lookup, cgroup ref pinning, are removed. Also removing
an extra "lru" argument.

Still hoping there is another way to avoid all these instead
 of just removing part of the overhead, e.g. maybe
change the API of list_lru_add, add a variant using GFP_NOWAIT
for atomic contexts, and on failure, let the caller try again after
calling some allocation helpers in an unlocked context? There isn't too
many users of list_lru_add, doing the refractor from that side seems
cleaner, not sure if doable though.