Hello, On Thu, May 23, 2024 at 12:34 PM Namhyung Kim <namhyung@xxxxxxxxxx> wrote: > > Hello, > > On Wed, May 15, 2024 at 9:56 PM Ian Rogers <irogers@xxxxxxxxxx> wrote: > > > > On Wed, May 15, 2024 at 9:24 PM Howard Chu <howardchu95@xxxxxxxxx> wrote: > > > > > > Hello, > > > > > > Here is a little update on --off-cpu. > > > > > > > > It would be nice to start landing this work so I'm wondering what the > > > > > minimal way to do that is. It seems putting behavior behind a flag is > > > > > a first step. > > > > > > The flag to determine output threshold of off-cpu has been implemented. > > > If the accumulated off-cpu time exceeds this threshold, output the sample > > > directly; otherwise, save it later for off_cpu_write. > > > > > > But adding an extra pass to handle off-cpu samples introduces performance > > > issues, here's the processing rate of --off-cpu sampling(with the > > > extra pass to extract raw > > > sample data) and without. The --off-cpu-threshold is in nanoseconds. > > > > > > +-----------------------------------------------------+---------------------------------------+----------------------+ > > > | comm | type > > > | process rate | > > > +-----------------------------------------------------+---------------------------------------+----------------------+ > > > | -F 4999 -a | regular > > > samples (w/o extra pass) | 13128.675 samples/ms | > > > +-----------------------------------------------------+---------------------------------------+----------------------+ > > > | -F 1 -a --off-cpu --off-cpu-threshold 100 | offcpu samples > > > (extra pass) | 2843.247 samples/ms | > > > +-----------------------------------------------------+---------------------------------------+----------------------+ > > > | -F 4999 -a --off-cpu --off-cpu-threshold 100 | offcpu & > > > regular samples (extra pass) | 3910.686 samples/ms | > > > +-----------------------------------------------------+---------------------------------------+----------------------+ > > > | -F 4999 -a --off-cpu --off-cpu-threshold 1000000000 | few offcpu & > > > regular (extra pass) | 4661.229 samples/ms | > > > +-----------------------------------------------------+---------------------------------------+----------------------+ > > What does the process rate mean? Is the sample for the > off-cpu event or other (cpu-cycles)? Is it from a single CPU > or system-wide or per-task? Process rate is just a silly name for the time record__pushfn() takes to write data from the ring buffer. record__pushfn() is where I added the extra pass to strip the off-cpu samples from the original raw samples that eBPF's perf_output collected. With -a option it runs on all cpu, system-wide. Sorry that I only tested on extreme cases. I ran perf record on `-F 4999 -a `, `-F 1 -a --off-cpu --off-cpu-threshold 100`, `-F 4999 -a --off-cpu --off-cpu-threshold 100`, and `-F 4999 -a --off-cpu --off-cpu-threshold 1000000000`. `-F 4999 -a` is only cpu-cycles samples which is the fastest(13128.675 samples/ms) when it comes to writing samples to perf.data, because there's no extra pass for stripping extra data from BPF's raw samples. `-F 1 -a --off-cpu --off-cpu-threshold 100` is mostly off-cpu samples, which requires considerably more time to strip the data, being the slowest(2843.247 samples/ms). `-F 4999 -a --off-cpu --off-cpu-threshold 100` is half and half, lots of cpu-cycle samples so a little faster than the former one(3910.686 samples/ms). Because for cpu-cycles samples, there's no extra handling(but there's still cost on copying memory back and forth). `-F 4999 -a --off-cpu --off-cpu-threshold 1000000000` is a blend of a large amount of cpu-cycles samples and only a couple of off-cpu samples. It is the fastest(4661.229 samples/ms) but still nowhere near the original one, which doesn't have the extra pass of off_cpu_strip(). What I'm trying to say is just, stripping/handling off-cpu samples at runtime is a bad idea, the extra pass of off_cpu_strip() should be reconsidered. Reading events one by one, put together samples, and checking sample_id and stuff introduces lots of overhead. It should be done at save time. By the way, the default off_cpu_write() is perfectly fine. Sorry about the horrible data table and explanation; they will be more readable next time. > > > > > > > It's not ideal. I will find a way to reduce overhead. For example > > > process them samples > > > at save time as Ian mentioned. > > > > > > > > To turn the bpf-output samples into off-cpu events there is a pass > > > > > added to the saving. I wonder if that can be more generic, like a save > > > > > time perf inject. > > > > > > And I will find a default value for such a threshold based on performance > > > and common use cases. > > > > > > > Sounds good. We might add an option to specify the threshold to > > > > determine whether to dump the data or to save it for later. But ideally > > > > it should be able to find a good default. > > > > > > These will be done before the GSoC kick-off on May 27. > > > > This all sounds good. 100ns seems like quite a low threshold and 1s > > extremely high, shame such a high threshold is marginal for the > > context switch performance change. I wonder 100 microseconds may be a > > more sensible threshold. It's 100 times larger than the cost of 1 > > context switch but considerably less than a frame redraw at 60FPS (16 > > milliseconds). > > I don't know what's the sensible default. But 1 msec could be > another candidate for the similar reason. :) Sure, I'll give them all a test and see the overhead they cause. I understand that all I'm talking about is optimization, and that premature optimization is the root of all evil. However, being almost three times slower for only a few dozen direct off-CPU samples sounds weird to me. Thanks, Howard > > Thanks, > Namhyung