Re: [RFC PATCH 00/30] Code tagging framework and applications

Mel Gorman <mgorman@xxxxxxx> · Thu, 1 Sep 2022 12:05:01 +0100

On Wed, Aug 31, 2022 at 11:59:41AM -0400, Kent Overstreet wrote:
> On Wed, Aug 31, 2022 at 11:19:48AM +0100, Mel Gorman wrote:
> > On Wed, Aug 31, 2022 at 04:42:30AM -0400, Kent Overstreet wrote:
> > > On Wed, Aug 31, 2022 at 09:38:27AM +0200, Peter Zijlstra wrote:
> > > > On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote:
> > > > > ===========================
> > > > > Code tagging framework
> > > > > ===========================
> > > > > Code tag is a structure identifying a specific location in the source code
> > > > > which is generated at compile time and can be embedded in an application-
> > > > > specific structure. Several applications of code tagging are included in
> > > > > this RFC, such as memory allocation tracking, dynamic fault injection,
> > > > > latency tracking and improved error code reporting.
> > > > > Basically, it takes the old trick of "define a special elf section for
> > > > > objects of a given type so that we can iterate over them at runtime" and
> > > > > creates a proper library for it.
> > > > 
> > > > I might be super dense this morning, but what!? I've skimmed through the
> > > > set and I don't think I get it.
> > > > 
> > > > What does this provide that ftrace/kprobes don't already allow?
> > > 
> > > You're kidding, right?
> > 
> > It's a valid question. From the description, it main addition that would
> > be hard to do with ftrace or probes is catching where an error code is
> > returned. A secondary addition would be catching all historical state and
> > not just state since the tracing started.
> 
> Catching all historical state is pretty important in the case of memory
> allocation accounting, don't you think?
> 

Not always. If the intent is to catch a memory leak that gets worse over
time, early boot should be sufficient. Sure, there might be drivers that leak
memory allocated at init but if it's not a growing leak, it doesn't matter.

> Also, ftrace can drop events. Not really ideal if under system load your memory
> accounting numbers start to drift.
> 

As pointed out elsewhere, attaching to the tracepoint and recording relevant
state is an option other than trying to parse a raw ftrace feed. For memory
leaks, there are already tracepoints for page allocation and free that could
be used to track allocations that are not freed at a given point in time.
There is also the kernel memory leak detector although I never had reason
to use it (https://www.kernel.org/doc/html/v6.0-rc3/dev-tools/kmemleak.html)
and it sounds like it would be expensive.

> > It's also unclear *who* would enable this. It looks like it would mostly
> > have value during the development stage of an embedded platform to track
> > kernel memory usage on a per-application basis in an environment where it
> > may be difficult to setup tracing and tracking. Would it ever be enabled
> > in production? Would a distribution ever enable this? If it's enabled, any
> > overhead cannot be disabled/enabled at run or boot time so anyone enabling
> > this would carry the cost without never necessarily consuming the data.
> 
> The whole point of this is to be cheap enough to enable in production -
> especially the latency tracing infrastructure. There's a lot of value to
> always-on system visibility infrastructure, so that when a live machine starts
> to do something wonky the data is already there.
> 

Sure, there is value but nothing stops the tracepoints being attached as
a boot-time service where interested. For latencies, there is already
bpf examples for tracing individual function latency over time e.g.
https://github.com/iovisor/bcc/blob/master/tools/funclatency.py although
I haven't used it recently.

Live parsing of ftrace is possible, albeit expensive.
https://github.com/gormanm/mmtests/blob/master/monitors/watch-highorder.pl
tracks counts of high-order allocations and dumps a report on interrupt as
an example of live parsing ftrace and only recording interesting state. It's
not tracking state you are interested in but it demonstrates it is possible
to rely on ftrace alone and monitor from userspace. It's bit-rotted but
can be fixed with

diff --git a/monitors/watch-highorder.pl b/monitors/watch-highorder.pl
index 8c80ae79e556..fd0d477427df 100755
--- a/monitors/watch-highorder.pl
+++ b/monitors/watch-highorder.pl
@@ -52,7 +52,7 @@ my $regex_pagealloc;
 
 # Static regex used. Specified like this for readability and for use with /o
 #                      (process_pid)     (cpus      )   ( time  )   (tpoint    ) (details)
-my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
+my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9. ]*):\s*([a-zA-Z_]*):\s*(.*)';
 my $regex_statname = '[-0-9]*\s\((.*)\).*';
 my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';
 
@@ -73,6 +73,7 @@ sub generate_traceevent_regex {
 				$regex =~ s/%p/\([0-9a-f]*\)/g;
 				$regex =~ s/%d/\([-0-9]*\)/g;
 				$regex =~ s/%lu/\([0-9]*\)/g;
+				$regex =~ s/%lx/\([0-9a-zA-Z]*\)/g;
 				$regex =~ s/%s/\([A-Z_|]*\)/g;
 				$regex =~ s/\(REC->gfp_flags\).*/REC->gfp_flags/;
 				$regex =~ s/\",.*//;

Example output

3 instances order=2 normal gfp_flags=GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_ZERO
 => trace_event_raw_event_mm_page_alloc+0x7d/0xc0 <ffffffffb1caeccd>
 => __alloc_pages+0x188/0x250 <ffffffffb1cee8a8>
 => kmalloc_large_node+0x3f/0x80 <ffffffffb1d1cd3f>
 => __kmalloc_node+0x321/0x420 <ffffffffb1d22351>
 => kvmalloc_node+0x46/0xe0 <ffffffffb1ca4906>
 => ttm_sg_tt_init+0x88/0xb0 [ttm] <ffffffffc03a02c8>
 => amdgpu_ttm_tt_create+0x4f/0x80 [amdgpu] <ffffffffc04cff0f>
 => ttm_tt_create+0x59/0x90 [ttm] <ffffffffc03a03b9>
 => ttm_bo_handle_move_mem+0x7e/0x1c0 [ttm] <ffffffffc03a0d9e>
 => ttm_bo_validate+0xc5/0x140 [ttm] <ffffffffc03a2095>
 => ttm_bo_init_reserved+0x17b/0x200 [ttm] <ffffffffc03a228b>
 => amdgpu_bo_create+0x1a3/0x470 [amdgpu] <ffffffffc04d36c3>
 => amdgpu_bo_create_user+0x34/0x60 [amdgpu] <ffffffffc04d39c4>
 => amdgpu_gem_create_ioctl+0x131/0x3a0 [amdgpu] <ffffffffc04d94f1>
 => drm_ioctl_kernel+0xb5/0x140 <ffffffffb21652c5>
 => drm_ioctl+0x224/0x3e0 <ffffffffb2165574>
 => amdgpu_drm_ioctl+0x49/0x80 [amdgpu] <ffffffffc04bd2d9>
 => __x64_sys_ioctl+0x8a/0xc0 <ffffffffb1d7c2da>
 => do_syscall_64+0x5c/0x90 <ffffffffb253016c>
 => entry_SYSCALL_64_after_hwframe+0x63/0xcd <ffffffffb260009b>

3 instances order=1 normal gfp_flags=GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT
 => trace_event_raw_event_mm_page_alloc+0x7d/0xc0 <ffffffffb1caeccd>
 => __alloc_pages+0x188/0x250 <ffffffffb1cee8a8>
 => __folio_alloc+0x17/0x50 <ffffffffb1cef1a7>
 => vma_alloc_folio+0x8f/0x350 <ffffffffb1d11e4f>
 => __handle_mm_fault+0xa1e/0x1120 <ffffffffb1cc80ee>
 => handle_mm_fault+0xb2/0x280 <ffffffffb1cc88a2>
 => do_user_addr_fault+0x1b9/0x690 <ffffffffb1a89949>
 => exc_page_fault+0x67/0x150 <ffffffffb2534627>
 => asm_exc_page_fault+0x22/0x30 <ffffffffb2600b62>

It's not tracking leaks because that is not what I was intrested in at
the time but could using the same method and recording PFNs that were
allocated, their call site but not freed. These days, this approach may
be a bit unexpected but it was originally written 13 years ago. It could
have been done with systemtap back then but my recollection was that it
was difficult to keep systemtap working with rc kernels.

> What we've built here this is _far_ cheaper than anything that could be done
> with ftrace.
> 
> > It might be an ease-of-use thing. Gathering the information from traces
> > is tricky and would need combining multiple different elements and that
> > is development effort but not impossible.
> > 
> > Whatever asking for an explanation as to why equivalent functionality
> > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.
> 
> I think perhaps some of the expectation should be on the "ftrace for
> everything!" people to explain a: how their alternative could be even built and
> b: how it would compare in terms of performance and ease of use.
> 

The ease of use is a criticism as there is effort required to develop
the state tracking of in-kernel event be it from live parsing ftrace,
attaching to tracepoints with systemtap/bpf/whatever and the like. The
main disadvantage with an in-kernel implementation is three-fold. First,
it doesn't work with older kernels without backports. Second, if something
slightly different it needed then it's a kernel rebuild.  Third, if the
option is not enabled in the deployed kernel config then you are relying
on the end user being willing to deploy a custom kernel.  The initial
investment in doing memory leak tracking or latency tracking by attaching
to tracepoints is significant but it works with older kernels up to a point
and is less sensitive to the kernel config options selected as features
like ftrace are often selected.

-- 
Mel Gorman
SUSE Labs