On Tue, 29 Mar 2022 14:54:21 -0700 Beau Belgrave <beaub@xxxxxxxxxxxxxxxxxxx> wrote: > On Tue, Mar 29, 2022 at 12:17:14PM -0400, Mathieu Desnoyers wrote: > > ----- On Mar 28, 2022, at 8:29 PM, Beau Belgrave beaub@xxxxxxxxxxxxxxxxxxx wrote: > > > > >> ----- On Mar 28, 2022, at 4:24 PM, Mathieu Desnoyers > > >> mathieu.desnoyers@xxxxxxxxxxxx wrote: > > [..] > > > >> I also don't get what user-space is expected to do differently if FTRACE vs > > >> PERF is enabled, considering that it gates a writev() to a file descriptor > > >> associated with /sys/kernel/debug/tracing/user_events_data. > > >> > > > > > > The intention wasn't for user-space to check the byte other than non-zero. > > > User processes could do that, but it's more so administration tools can see > > > where the events are registered if they cannot be closed and the state of the > > > machine. > > > > Administration tools could simply use a seqfile interface and parse text. I > > don't expect those to be performance-sensitive at all. So we should not design > > the overhead-sensitive mmap'd ABI (enable flags) for administration tools. > > > > > Maybe you have a slicker way to do this, but it seems to check a bit in the > > > page would involve at least a byte read followed by a mask or shift? > > > > I suspect the minimum cost in terms of instruction would be to have the mask > > prepared in advance (when the event is registered), and therefore the instructions > > required on the user-space fast-path would be a load followed by a mask (&). > > > > > That seems more expensive than checking a byte? > > > > It is one single extra ALU instruction on the fast-path, indeed. But we have to > > be careful about optimizing too much for a non-representative microbenchmark > > and missing the big picture, which includes L1/L2 cache lines and TLB load overhead > > when tracing a real-life workloads. > > > > On modern x86 CPUs, an ALU instruction takes around 0.3ns, which is a few > > orders of magnitude faster than a cache miss. Using bit-wise rather than > > byte-wise indexing increases functional density of those enable flags by a > > factor 8, which lessens the amount of cpu cache required by that same factor. > > > > So as soon as the workload running on the system fills the CPU L1 or L2 caches, > > when the number of user events registered scales up, and when the checks for > > events enabled are done frequently enough, evicting fewer cache lines becomes > > more important than the extra instruction required to apply the mask on the > > fast-path. > > > > Ok, makes sense. This will give us more events as well, something I hear > some complaints about. > > > > > > >> I would have rather thought that tracers implemented in user-space could > > >> register > > >> themselves, and then there could be one > > >> /sys/kernel/debug/tracing/user_events_status > > >> per tracer. Considering that all kernel tracers use the same ABI to write an > > >> event, > > >> and then dispatch this event internally within the kernel to each registered > > >> tracer, I would expect to have a single memory mapping for all those (e.g. a > > >> /sys/kernel/debug/tracing/user_events_status/kernel_tracers file). I'm not sure you understand the meaning of per-event flag, but as far as I know the reason of those per-event flags may be same reason of SDT semaphore. The event which is NOT traced from kernel(via perf or via ftrace) needs to be skipped because preparing the user-event arguments (e.g. getting a variable "name" in user-space language runtime) may take a long time. > > >> Then eventually if we have other user-space tracers such as lttng-ust with its > > >> their own user-space code performing tracing in a shared memory ring buffer, it > > >> would make sense to allow it to register its own > > >> /sys/kernel/debug/tracing/user_events_status/lttng_ust file, with its own > > >> indexes. > > >> > > > > > > I don't follow that. The intention is to get user processes to participate with > > > trace_events and the built-in tooling. When would a user-space tracer be used > > > instead of perf/ftrace? > > > > > > It seems like a feature request? > > > > It can very well be out of scope for the user events, and I'm fine with that. > > I was merely considering how the user events could be leveraged by tracers > > implemented purely in user-space. But if the stated goal of this feature is > > really to call into kernel tracers through a writev(), then I suspect that > > supporting purely user-space tracers is indeed out of scope. > > > > That was the goal with this ABI, are there maybe ways we can change the > ABI to accomodate this later without shutting that out? Beau, would you have any plan to make a tracer-tool application? I thought this ABI is for adding "logger" like interface for perf/ftrace (or BPF, that will be a long way). > > > > > > >> If this facility is ever used by lttng-ust to enable user-space tracing, I would > > >> not > > >> want to take the overhead of calling writev for the sake of kernel tracers if > > >> those are disabled. I think this is totally different design with the lttng-ust, which collects event data in user-space and kernel-space separately. Users will use them properly according to the situation and purpose. > > > If they were disabled the byte wouldn't be set, right? So no writev overhead. > > > > > > Seems I'm missing something. > > > > I was wondering whether we could leverage the user events bit-enable ABI for > > tracers which are not implemented within the Linux kernel (e.g. LTTng-UST). > > But I suspect it might be better for me to re-implement this in user-space > > over shared memory. > > > > Perhaps. Sorry, I actually doubt it, since it relays on the user space program which correctly works and write the event correct way (this is why I had asked you to verify the event format at writev). Or perhaps we can verify it when merging the shared pages... > > > > > >> So perhaps in the short-term there is no need to implement the user-space tracer > > >> registration ABI, but I would have expected a simple bitmap for > > >> /sys/kernel/debug/tracing/user_events_data/kernel_tracers rather than the > > >> bytewise index, because as far as the kernel tracers are concerned, providing > > >> the bit to tell userspace instrumentation exactly which tracers are internally > > >> enabled within the kernel does not appear to be of any use other than increasing > > >> the footprint on the actively used cpu cache lines. I agreed with this. > > >> * user_events_data page faults > > >> > > >> If my understanding is correct, when the user-space program's memory containing > > >> the payload passed to writev() to a user_events_data file descriptor is kicked > > >> out from the page cache between fault_in_iov_iter_readable and its use by the > > >> tracers due to high memory pressure, the writev() will fail with -EFAULT and > > >> the data will be discarded unless user-space somehow handles this error (which > > >> is not handled in the samples/user_events/sample.c example program). It is good > > >> that the memory is faulted in immediately before calling the tracers, but > > >> considering that it is not mlock'd, should we make more effort to ensure the > > >> tracers are able to handle page faults ? I think this should be handled by user-space implementation (retry?). > > >> Integration of the work done by Michael Jeanson and myself on faultable > > >> tracepoint > > >> would allow the tracepoint probes to take page faults. Then, further > > >> modifications > > >> in the kernel tracers would be needed to handle those page faults. Interesting. > > > Is this something that can be done later or does it require ABI changes? > > > > > > I would love to never miss data due to page faults. > > > > This is internal to the user events, tracepoint, and kernel tracers > > implementation. I don't expect this to modify the user events ABI. > > > > The only thing that might require some thinking ABI-wise is how the ring > > buffers are exposed to user-space consumers, because we would want to > > allow taking page faults between space reservation and commit. > > > > The LTTng ring buffer has been supporting this out of the box for years > > now, but this may be trickier for other kernel tracers, for which allowing > > preemption between reserve and commit has never been a requirement until > > now. > > > > I'll let Steven comment on that one. > > > > > > >> > > >> * user_reg name_args and write_index vs purely user-space tracers > > >> > > >> That part of the user event registration (event layout and ID allocation) > > >> appears > > >> to be intrinsically tied to the kernel tracers and the expected event layout. > > >> This > > >> seems fine as long as the only users we consider are the kernel tracers, but it > > >> appears to be less relevant for purely user-space tracers. Actually, tying the > > >> mmap'd event enable mechanism with the event ID and description makes me wonder > > >> whether it might be better to have LTTng-UST implement its own shared-memory > > >> based > > >> "fast-event-enabling" mechanism rather than use this user-event ABI. The other > > >> advantage of doing all of this in user-space would be to allow many instances > > >> of this bitmap to exist on a given system, e.g. one per container in a > > >> multi-container > > >> system, rather than requiring this to be a global kernel-wide singleton, and to > > >> use > > >> it from a non-privileged user. > > >> > > > > > > We have some conversation going about using namespaces/cgroups to isolation > > > containers with bitmaps/status pages. The main thing I personally want to be > > > able to do is from the root namespace see all the events in the descendents > > > easily via perf, eBPF or ftrace. > > > > > > Link: > > > https://lore.kernel.org/linux-trace-devel/20220316232009.7952988633787ef1003f13b0@xxxxxxxxxx/ > > > > > > > I see that a notion close to "tracing namespaces" comes up in this thread. This is something > > I brought forward at the last Linux Plumbers Conference aiming to facilitate system call tracing > > for a given container (or from a process hierarchy). I suspect the tracing namespace could also > > be tied to a set of kernel buffers, and to a user events "domain". I think this concept could > > neatly solve many of our container-related isolation issues here. As you say, it should probably > > be a hierarchical namespace. > > > > Agreed. Great :-) > > >> Some comments about the implementation: > > [...] > > > > > >> > > >> return user_events_write_core(file, &i); > > >> } > > >> > > >> include/uapi/linux/user_events.h: > > >> > > >> struct user_reg { > > >> > > >> /* Input: Size of the user_reg structure being used */ > > >> __u32 size; > > >> > > >> /* Input: Pointer to string with event name, description and flags */ > > >> __u64 name_args; > > >> > > >> /* Output: Byte index of the event within the status page */ > > >> __u32 status_index; > > >> > > >> /* Output: Index of the event to use when writing data */ > > >> __u32 write_index; > > >> }; > > >> > > >> As this structure is expected to grow, and the user-space sample program uses > > >> "sizeof()" > > >> to figure out its size (which includes padding), I would be more comfortable if > > >> this was > > >> a packed structure rather than non-packed, because as fields are added, it's > > >> tricky to > > >> figure out from the kernel perspective whether the size received are fields that > > >> user-space > > >> is aware of, or if this is just padding. Indeed, any user-space exposed structure should be packed or correctly aligned, because those can be built on many architectures. Thank you, > > >> > > > > > > I think that would be a good idea, Steven? > > > > [leaving this for Steven to answer] > > > -- Masami Hiramatsu <mhiramat@xxxxxxxxxx>