On Fri, 24 Mar 2023 15:30:17 -0700 Beau Belgrave <beaub@xxxxxxxxxxxxxxxxxxx> wrote: > As part of the discussions for user_events aligned with user space > tracers, it was determined that user programs should register a aligned > value to set or clear a bit when an event becomes enabled. Currently a > shared page is being used that requires mmap(). Remove the shared page > implementation and move to a user registered address implementation. > > In this new model during the event registration from user programs 3 new > values are specified. The first is the address to update when the event > is either enabled or disabled. The second is the bit to set/clear to > reflect the event being enabled. The third is the size of the value at > the specified address. > > This allows for a local 32/64-bit value in user programs to support > both kernel and user tracers. As an example, setting bit 31 for kernel > tracers when the event becomes enabled allows for user tracers to use > the other bits for ref counts or other flags. The kernel side updates > the bit atomically, user programs need to also update these values > atomically. > > User provided addresses must be aligned on a natural boundary, this > allows for single page checking and prevents odd behaviors such as a > enable value straddling 2 pages instead of a single page. > > When page faults are encountered they are done asyncly via a workqueue. > If the page faults back in, the write update is attempted again. If the > page cannot fault-in, then we log and wait until the next time the event > is enabled/disabled. This is to prevent possible infinite loops resulting > from bad user processes unmapping or changing protection values after > registering the address. This series looks good to me. Acked-by: Masami Hiramatsu (Google) <mhiramat@xxxxxxxxxx> for the series. Thank you! > > Change history > > V9: > Updated documentation to include __packed attributed in example payload. > > Removed boot argument for max events, since sysctl can now be set via > kernel boot parameters. > > V8: > Rebase to 6.2-rc8. > > V7: > Rebase to 6.2-rc4. > > Added flags to register ioctl, validates it's 0 for now. Future patches > will enable other types of formats/options as needed. > > V6: > Rebase to 6.2-rc2. > > Fixed small typos, code style. > > Changed from synchronize_rcu() to queue_rcu_work() to allow an rcu > delay asyncly when mm is being removed and in an appropriate context > for mmdrop(). > > V5: > GFP_NOWAIT is still needed in user_event_enabler_dup(), due to rcu lock. > > V4: > Rebase to 6.1-rc7. > > Moved user_events_fork() out of task signal lock and dropped use of > GFP_NOWAIT. All allocations are now GFP_KERNEL or GFP_KERNEL_ACCOUNT. > > Added boot parameter user_events_max= to limit global events. > > Added sysctl value kernel.user_events_max to limit global events. > > Added cgroup tracking of memory allocated for events. > > V3: > Rebase to 6.1-rc6. > > Removed RFC tag on series. > > Updated documentation to reflect ABI changes. > > Added self-test for ABI specific clone/fork cases. > > Moved user_event_mm removal into do_exit() to ensure RSS task accounting > is done properly in async fault paths. Also lets us remove the delayed > mmdrop(), saving memory in each user_event_mm struct. > > Fixed timing window where task exits, but write could be in-progress. > During exit we now take mmap_write_lock to ensure we drain writes. > > V2: > Rebase to 6.1-rc5. > > Added various comments based on feedback. > > Added enable_size to register struct, allows 32/64 bit addresses > as long as the enable_bit fits and the address is naturally aligned. > > Changed user_event_enabler_write to accept a new flag indicating if a > fault fixup should be done or not. This allows user_event_enabler_create > to return back failures to the user ioctl reg call and retry to fault > in data. > > Added tracking fork/exec/exit of tasks to have the user_event_mm lifetime > tied more to the task than the file. This came with extra requirements > around when you can lock, such as softirq cases, as well as a RCU > pattern to ensure fork/exec/exit take minimal lock times. > > Changed enablers to use a single word-aligned value for saving the bit > to set and any flags, such as faulting asyncly or being freed. This was > required to ensure atomic bit set/test for fork cases where taking the > event_mutex is not a good scalability decision. > > Added unregister IOCTL, since file lifetime no longer limits the enable > time for any events (the mm does). > > Updated sample code to reflect the new remote write based ABI. > > Updated self-test code to reflect the new remote write based ABI. > > Beau Belgrave (11): > tracing/user_events: Split header into uapi and kernel > tracing/user_events: Track fork/exec/exit for mm lifetime > tracing/user_events: Use remote writes for event enablement > tracing/user_events: Fixup enable faults asyncly > tracing/user_events: Add ioctl for disabling addresses > tracing/user_events: Update self-tests to write ABI > tracing/user_events: Add ABI self-test > tracing/user_events: Use write ABI in example > tracing/user_events: Update documentation for ABI > tracing/user_events: Charge event allocs to cgroups > tracing/user_events: Limit global user_event count > > Documentation/trace/user_events.rst | 179 ++-- > fs/exec.c | 2 + > include/linux/sched.h | 5 + > include/linux/user_events.h | 101 ++- > include/uapi/linux/user_events.h | 81 ++ > kernel/exit.c | 2 + > kernel/fork.c | 2 + > kernel/trace/Kconfig | 5 +- > kernel/trace/trace_events_user.c | 851 +++++++++++++++--- > samples/user_events/example.c | 47 +- > tools/testing/selftests/user_events/Makefile | 2 +- > .../testing/selftests/user_events/abi_test.c | 226 +++++ > .../testing/selftests/user_events/dyn_test.c | 2 +- > .../selftests/user_events/ftrace_test.c | 162 ++-- > .../testing/selftests/user_events/perf_test.c | 39 +- > 15 files changed, 1306 insertions(+), 400 deletions(-) > create mode 100644 include/uapi/linux/user_events.h > create mode 100644 tools/testing/selftests/user_events/abi_test.c > > > base-commit: ceaa837f96adb69c0df0397937cd74991d5d821a > -- > 2.25.1 > -- Masami Hiramatsu (Google) <mhiramat@xxxxxxxxxx>