[ Take-two because I had an extra comma at the end of my Cc list for sending via quilt ] While debugging the rtla and timerlat issue[1], I started using trace_printk() to see what was happening as the task_struct was randomly being corrupted. But the reproducer to this bug would cause a kernel crash. I wanted to try the new persistent ring buffer (which has not been merged yet), to debug this. As I was using trace_printk(), I needed the trace_printk() to write into this buffer. trace_printk() currently only writes into the top trace buffer, and I didn't want to use the trace_array_printk() for this, as it seemed too clunky. Instead, I wrote up this patch series that allows an instance created on the kernel command line to be used as the trace_printk destination buffer. To accomplish this, I needed a way to add an option to the trace_instance command line. I created "flags" to modify the trace instance after it is created. If the trace_instance has a "^<flag>" in its name, it is parsed as a flag. reserve_mem=12M:4096:trace trace_instance=boot_map^traceprintk@trace The "^traceprintk" would tell the kernel to use the boot_map instance as the location of the trace_printk(). Now when the kernel crashed, the boot instance would have the trace_printk() output. Note, when I first did this, the kernel crashed when reading the boot buffer, as the trace_printk() would default into using the trace_bprintk() which saves the pointer to the format string and the raw arguments in the buffer. On reading it, it would try to read the pointer where the format string was located at in the previous kernel and crashed. To solve this, I added a flag to the trace_array to denote that it is a boot trace buffer, and if it is set, then the binary trace_printks will call the version that does the formatting before adding it to the buffer. This may be slightly slower, but it's safer. I also noticed that using trace_printk() in the boot buffer, it would write to the buffer on the boot after the crash. This caused events to be listed in the buffer to be mixed between two kernels, and it was really confusing to read. I added another boot time flag to have the tracing instance boot with tracing disabled. trace_instance=boot_map^traceoff^traceprintk@trace With this, and enabling the printk console trace event in the boot buffer, I got nice traces like this: # trace-cmd start -B boot_map -e printk # <run test to crash kernel> [CRASH / REBOOT ] # trace-cmd show -B boot_map [..] <idle>-0 [004] dNh1. 45.277163: timerlat_irq: exit timer no restart <...>-910 [007] ....1 45.277201: wait_next_period.isra.0: hrtimer start ffff88823c7b5b28 <...>-907 [004] ....1 45.277206: wait_next_period.isra.0: hrtimer start ffff88823c635b28 <...>-905 [003] ....1 45.277240: timerlat_fd_release: OSN_VAR KTHREAD [3] timerlatu/3:905 ffff88810d5f4500 <...>-920 [003] ..... 45.277316: stop_kthread: kill timerlatu/3:905 ffff88810d5f4500 <...>-920 [003] ..... 45.277348: stop_kthread: OSN_VAR KTHREAD [4] timerlatu/4:907 ffff8881104ae780 <...>-920 [003] ..... 45.277357: stop_kthread: kill timerlatu/4:907 ffff8881104ae780 <...>-920 [003] ..... 45.277358: stop_kthread: OSN_VAR KTHREAD [5] timerlatu/5:908 ffff8881104aa280 <...>-920 [003] ..... 45.277365: stop_kthread: kill timerlatu/5:908 ffff8881104aa280 <...>-920 [003] ..... 45.277367: stop_kthread: OSN_VAR KTHREAD [6] timerlatu/6:909 ffff8881104ab3c0 <...>-920 [003] ..... 45.277375: stop_kthread: kill timerlatu/6:909 ffff8881104ab3c0 <...>-920 [003] ..... 45.277377: stop_kthread: OSN_VAR KTHREAD [7] timerlatu/7:910 ffff8881104a9140 <...>-920 [003] ..... 45.277386: stop_kthread: kill timerlatu/7:910 ffff8881104a9140 <...>-909 [006] d.h1. 45.277556: timerlat_irq: enter timer <...>-909 [006] d.h1. 45.277561: timerlat_irq: exit timer no restart <...>-908 [005] d.h1. 45.277919: timerlat_irq: enter timer <...>-908 [005] d.h1. 45.277923: timerlat_irq: exit timer no restart <...>-907 [004] d.h1. 45.278080: timerlat_irq: enter timer <...>-910 [007] d.h1. 45.278081: timerlat_irq: enter timer <...>-907 [004] d.h1. 45.278083: timerlat_irq: exit timer no restart <...>-910 [007] d.h1. 45.278084: timerlat_irq: exit timer no restart <...>-1 [002] d..1. 45.278665: console: ================================================================== <...>-1 [002] d..1. 45.280870: console: BUG: KASAN: slab-use-after-free in proc_pid_lookup+0xbf/0x210 <...>-1 [002] d..1. 45.282876: console: Write of size 4 at addr ffff88810d5f4528 by task systemd/1 <...>-1 [002] d..1. 45.284781: console: <...>-1 [002] d..1. 45.285295: console: CPU: 2 UID: 0 PID: 1 Comm: systemd Not tainted 6.11.0-rc3-test-00027-g014f473a3416-dirty #124 e44bf2780799707baf299f82ac321c3be7495d33 <...>-1 [002] d..1. 45.289265: console: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 <...>-1 [002] d..1. 45.292001: console: Call Trace: <...>-1 [002] d..1. 45.292764: console: <TASK> <...>-1 [002] d..1. 45.293441: console: dump_stack_lvl+0x53/0x70 <...>-1 [002] d..1. 45.294575: console: print_report+0xc6/0x640 <...>-1 [002] d..1. 45.295672: console: ? __pfx__raw_spin_lock_irqsave+0x10/0x10 <...>-1 [002] d..1. 45.297182: console: ? proc_pid_lookup+0xbf/0x210 <...>-1 [002] d..1. 45.298407: console: kasan_report+0xc6/0x100 <...>-1 [002] d..1. 45.299496: console: ? proc_pid_lookup+0xbf/0x210 <...>-1 [002] d..1. 45.300707: console: kasan_check_range+0xf4/0x1a0 <...>-1 [002] d..1. 45.301931: console: proc_pid_lookup+0xbf/0x210 [..] And I was able to easily debug the situation. Now that I have an easy way to make the trace_printk() write to any instance, I decided to add a run time option to the instances to tell them to be the destination to the trace_printk(). That is, you don't need to add a kernel command line to make the trace_printk() write to an instance. All you need to do is set the trace_printk_dest option in the instance (note, the trace_printk option was already taken at the top level that can be used to disable trace_printk). Only one instance is allowed to be the trace_printk destination, so setting this flag in one instance will clear it in another. Also, an instance must always be the destination of the trace_printk() function, so clearing the flag in any instance will automatically cause the top level instance to be set. Trying to clear the flag of the top level instance will return -EINVAL. echo 1 > /sys/kernel/tracing/instances/boot_map/options/trace_printk_dest With all these fun new ways to debug the kernel, I figured it's about time to start adding it to the Documentation directory. I created a debugging.rst file in Documentation/trace, that is the start of adding techniques in using tracing to debug your kernel. [1] https://lore.kernel.org/all/20240821160316.02c03c44@xxxxxxxxxxxxxxxxxx/ These patches are based on top of: git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git for-next Steven Rostedt (5): tracing: Add "traceoff" flag to boot time tracing instances tracing: Allow trace_printk() to go to other instance buffers tracing: Have trace_printk not use binary prints if boot buffer tracing: Add option to set an instance to be the trace_printk destination tracing/Documentation: Start a document on how to debug with tracing ---- Documentation/admin-guide/kernel-parameters.txt | 23 ++++ Documentation/trace/debugging.rst | 159 ++++++++++++++++++++++++ Documentation/trace/ftrace.rst | 12 ++ kernel/trace/trace.c | 127 ++++++++++++++++--- kernel/trace/trace.h | 4 +- kernel/trace/trace_output.c | 5 +- 6 files changed, 313 insertions(+), 17 deletions(-) create mode 100644 Documentation/trace/debugging.rst