Re: [PATCH 00/40] Memory allocation profiling

Suren Baghdasaryan <surenb@xxxxxxxxxx> · Wed, 3 May 2023 08:09:28 -0700

On Wed, May 3, 2023 at 12:25 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
>
> On Mon 01-05-23 09:54:10, Suren Baghdasaryan wrote:
> > Memory allocation profiling infrastructure provides a low overhead
> > mechanism to make all kernel allocations in the system visible. It can be
> > used to monitor memory usage, track memory hotspots, detect memory leaks,
> > identify memory regressions.
> >
> > To keep the overhead to the minimum, we record only allocation sizes for
> > every allocation in the codebase. With that information, if users are
> > interested in more detailed context for a specific allocation, they can
> > enable in-depth context tracking, which includes capturing the pid, tgid,
> > task name, allocation size, timestamp and call stack for every allocation
> > at the specified code location.
> [...]
> > Implementation utilizes a more generic concept of code tagging, introduced
> > as part of this patchset. Code tag is a structure identifying a specific
> > location in the source code which is generated at compile time and can be
> > embedded in an application-specific structure. A number of applications
> > for code tagging have been presented in the original RFC [1].
> > Code tagging uses the old trick of "define a special elf section for
> > objects of a given type so that we can iterate over them at runtime" and
> > creates a proper library for it.
> >
> > To profile memory allocations, we instrument page, slab and percpu
> > allocators to record total memory allocated in the associated code tag at
> > every allocation in the codebase. Every time an allocation is performed by
> > an instrumented allocator, the code tag at that location increments its
> > counter by allocation size. Every time the memory is freed the counter is
> > decremented. To decrement the counter upon freeing, allocated object needs
> > a reference to its code tag. Page allocators use page_ext to record this
> > reference while slab allocators use memcg_data (renamed into more generic
> > slabobj_ext) of the slab page.
> [...]
> > [1] https://lore.kernel.org/all/20220830214919.53220-1-surenb@xxxxxxxxxx/
> [...]
> >  70 files changed, 2765 insertions(+), 554 deletions(-)
>
> Sorry for cutting the cover considerably but I believe I have quoted the
> most important/interesting parts here. The approach is not fundamentally
> different from the previous version [1] and there was a significant
> discussion around this approach. The cover letter doesn't summarize nor
> deal with concerns expressed previous AFAICS. So let me bring those up
> back.

Thanks for summarizing!

> At least those I find the most important:
> - This is a big change and it adds a significant maintenance burden
>   because each allocation entry point needs to be handled specifically.
>   The cost will grow with the intended coverage especially there when
>   allocation is hidden in a library code.

Do you mean with more allocations in the codebase more codetags will
be generated? Is that the concern? Or maybe as you commented in
another patch that context capturing feature does not limit how many
stacks will be captured?

> - It has been brought up that this is duplicating functionality already
>   available via existing tracing infrastructure. You should make it very
>   clear why that is not suitable for the job

I experimented with using tracing with _RET_IP_ to implement this
accounting. The major issue is the _RET_IP_ to codetag lookup runtime
overhead which is orders of magnitude higher than proposed code
tagging approach. With code tagging proposal, that link is resolved at
compile time. Since we want this mechanism deployed in production, we
want to keep the overhead to the absolute minimum.
You asked me before how much overhead would be tolerable and the
answer will always be "as small as possible". This is especially true
for slab allocators which are ridiculously fast and regressing them
would be very noticable (due to the frequent use).

There is another issue, which I think can be solved in a smart way but
will either affect performance or would require more memory. With the
tracing approach we don't know beforehand how many individual
allocation sites exist, so we have to allocate code tags (or similar
structures for counting) at runtime vs compile time. We can be smart
about it and allocate in batches or even preallocate more than we need
beforehand but, as I said, it will require some kind of compromise.

I understand that code tagging creates additional maintenance burdens
but I hope it also produces enough benefits that people will want
this. The cost is also hopefully amortized when additional
applications like the ones we presented in RFC [1] are built using the
same framework.

> - We already have page_owner infrastructure that provides allocation
>   tracking data. Why it cannot be used/extended?

1. The overhead.
2. Covers only page allocators.

I didn't think about extending the page_owner approach to slab
allocators but I suspect it would not be trivial. I don't see
attaching an owner to every slab object to be a scalable solution. The
overhead would again be of concern here.

I should point out that there was one important technical concern
about lack of a kill switch for this feature, which was an issue for
distributions that can't disable the CONFIG flag. In this series we
addressed that concern.

[1] https://lore.kernel.org/all/20220830214919.53220-1-surenb@xxxxxxxxxx/

Thanks,
Suren.

>
> Thanks!
> --
> Michal Hocko
> SUSE Labs