+Cc Douglas On 07/10/22 17:01, Marcelo Tosatti wrote: > Hi Valentin, > > On Fri, Oct 07, 2022 at 04:41:40PM +0100, Valentin Schneider wrote: >> Background >> ========== >> >> As for the targeted CPUs, the existing tracepoint does export them, albeit in >> cpumask form, which is quite inconvenient from a tooling perspective. For >> instance, as far as I'm aware, it's not possible to do event filtering on a >> cpumask via trace-cmd. > > https://man7.org/linux/man-pages/man1/trace-cmd-set.1.html > > -f filter > Specify a filter for the previous event. This must come after > a -e. This will filter what events get recorded based on the > content of the event. Filtering is passed to the kernel > directly so what filtering is allowed may depend on what > version of the kernel you have. Basically, it will let you > use C notation to check if an event should be processed or > not. > > ==, >=, <=, >, <, &, |, && and || > > The above are usually safe to use to compare fields. > > This looks overkill to me (consider large number of bits set in mask). > > +#define trace_ipi_send_cpumask(callsite, mask) do { \ > + if (static_key_false(&__tracepoint_ipi_send_cpu.key)) { \ > + int cpu; \ > + for_each_cpu(cpu, mask) \ > + trace_ipi_send_cpu(callsite, cpu); \ > + } \ > +} while (0) > Indeed, I expected pushback on this :-) I went for this due to how much simpler an int is to process/use compared to a cpumask. There is the trigger example I listed above, but the consumption of the trace event itself as well. Consider this event collected on an arm64 QEMU instance (output from trace-cmd) <...>-234 [001] 37.251567: ipi_raise: target_mask=00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000004 (Function call interrupts) That sort of formatting has been an issue downstream for things like LISA [1] where events are aggregated into Pandas tables, and we need to play silly games for performance reason because bitmasks aren't a native Python type. I had a look at libtraceevent to see how this data is exposed and if the answer would be better tooling: tep_get_field_val() just yields an unsigned long long of value 0x200018, which AFAICT is just the [length, offset] thing associated with dynamic arrays. Not really usable, and I don't see anything exported in the lib to extract and use those values. tep_get_field_raw() is better, it handles the dynamic array for us and yields a pointer to the cpumask array at the tail of the record. With that it's easy to get an output such as: cpumask[size=32]=[4,0,0,0,]. Still, this isn't a native type for many programming languages. In contrast, this is immediately readable and consumable by userspace tools <...>-234 [001] 37.250882: ipi_send_cpu: callsite=__smp_call_single_queue+0x5c target_cpu=2 Thinking out loud, it makes way more sense to record a cpumask in the tracepoint, but perhaps we could have a postprocessing step to transform those into N events each targeting a single CPU? [1]: https://github.com/ARM-software/lisa/blob/37b51243a94b27ea031ff62bb4ce818a59a7f6ef/lisa/trace.py#L4756 > >> >> Because of the above points, this is introducing a new tracepoint. >> >> Patches >> ======= >> >> This results in having trace events for: >> >> o smp_call_function*() >> o smp_send_reschedule() >> o irq_work_queue*() >> >> This is incomplete, just looking at arm64 there's more IPI types that aren't covered: >> >> IPI_CPU_STOP, >> IPI_CPU_CRASH_STOP, >> IPI_TIMER, >> IPI_WAKEUP, >> >> ... But it feels like a good starting point. > > Can't you have a single tracepoint (or variant with cpumask) that would > cover such cases as well? > > Maybe (as parameters for tracepoint): > > * type (reschedule, smp_call_function, timer, wakeup, ...). > > * function address: valid for smp_call_function, irq_work_queue > types. > That's a good point, I wasn't sure about having a parameter serving as discriminant for another, but the function address would be either valid or NULL which is fine. So perhaps: o callsite (i.e. _RET_IP_), serves as type o address of callback tied to IPI, if any o target CPUs >> Another thing worth mentioning is that depending on the callsite, the _RET_IP_ >> fed to the tracepoint is not always useful - generic_exec_single() doesn't tell >> you much about the actual callback being sent via IPI, so there might be value >> in exploding the single tracepoint into at least one variant for smp_calls. > > Not sure i grasp what you mean by "exploding the single tracepoint...", > but yes knowing the function or irq work function is very useful. > Sorry; I meant having several "specialized" tracepoints instead of a single one.