Re: [PATCH bpf-next v2] selftests/bpf: emit top frequent code lines in veristat

Eduard Zingerman <eddyz87@xxxxxxxxx> · Wed, 18 Sep 2024 19:15:05 -0700

On Wed, 2024-09-18 at 21:39 +0100, Mykyta Yatsenko wrote:
> From: Mykyta Yatsenko <yatsenko@xxxxxxxx>
> 
> Production BPF programs are increasing in number of instructions and states
> to the point, where optimising verification process for them is necessary
> to avoid running into instruction limit. Authors of those BPF programs
> need to analyze verifier output, for example, collecting the most
> frequent source code lines to understand which part of the program has
> the biggest verification cost.
> 
> This patch introduces `--top-src-lines` flag in veristat.
> `--top-src-lines=N` makes veristat output N the most popular sorce code
> lines, parsed from verification log.
> 
> An example:
> ```
> $ sudo ./veristat --log-size=1000000000 --top-src-lines=4  pyperf600.bpf.o
> Processing 'pyperf600.bpf.o'...
> Top source lines (on_event):
>  4697: (pyperf.h:0)	 
>  2334: (pyperf.h:326)	event->stack[i] = *symbol_id; 
>  2334: (pyperf.h:118)	pidData->offsets.String_data); 
>  1176: (pyperf.h:92)	bpf_probe_read_user(&frame->f_back, 
> ...
> ```
> 
> Signed-off-by: Mykyta Yatsenko <yatsenko@xxxxxxxx>

I think this is a cool feature!
It's a bit of a shame that we don't collect information like this in
the verifier itself, where it would be simpler to do (e.g. associate a
counter with each instruction, or with each jump target).

[...]

> +static int print_top_src_lines(char * const buf, size_t buf_sz, const char *prog_name)
> +{
> +	int lines_cap = 1;
> +	int lines_size = 0;
> +	char **lines;
> +	char *line = NULL;
> +	char *state;
> +	struct line_cnt *freq = NULL;
> +	struct line_cnt *cur;
> +	int unique_lines;
> +	int err;

Note:
  when compiling with clang 20.0.0git the following warning is reported:

veristat.c:957:14: error: variable 'err' is used uninitialized whenever 'for' loop exits because its condition is false [-Werror,-Wsometimes-uninitialized]
  957 |         for (i = 0; i < min(unique_lines, env.top_src_lines); ++i) {
      |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
veristat.c:972:9: note: uninitialized use occurs here
  972 |         return err;
      |
  ...
veristat.c:903:9: note: initialize the variable 'err' to silence this warning
  903 |         int err;
      |                ^
      |                 = 0

Also, a nitpick: declarations should be sorted in a "reverse Christmas
tree" order (at-least that's what Andrii enforces :).

> +	int i;
> +
> +	lines = calloc(lines_cap, sizeof(char *));

Nitpick: here and in a few places below use sizeof(*<array>), e.g.:
         calloc(lines_cap, sizeof(*lines))

> +	if (!lines)
> +		return -ENOMEM;
> +
> +	while ((line = strtok_r(line ? NULL : buf, "\n", &state))) {
> +		if (strncmp(line, "; ", 2))
> +			continue;
> +		line += 2;
> +
> +		if (lines_size == lines_cap) {
> +			char **tmp;
> +
> +			lines_cap *= 2;
> +			tmp = realloc(lines, lines_cap * sizeof(char *));
> +			if (!tmp) {
> +				err = -ENOMEM;
> +				goto cleanup;
> +			}
> +			lines = tmp;
> +		}
> +		lines[lines_size] = line;
> +		lines_size++;
> +	}
> +
> +	if (!lines_size)
> +		goto cleanup;
> +
> +	qsort(lines, lines_size, sizeof(char *), str_cmp);
> +
> +	freq = calloc(lines_size, sizeof(struct line_cnt));
> +	if (!freq) {
> +		err = -ENOMEM;
> +		goto cleanup;
> +	}
> +
> +	cur = freq;
> +	cur->line = lines[0];
> +	cur->cnt = 1;
> +	for (i = 1; i < lines_size; ++i) {
> +		if (strcmp(lines[i], cur->line)) {
> +			cur++;
> +			cur->line = lines[i];
> +			cur->cnt = 0;
> +		}
> +		cur->cnt++;
> +	}
> +	unique_lines = cur - freq + 1;
> +
> +	qsort(freq, unique_lines, sizeof(struct line_cnt), line_cnt_cmp);
> +
> +	printf("Top source lines (%s):\n", prog_name);
> +	for (i = 0; i < min(unique_lines, env.top_src_lines); ++i) {
> +		char *src_code;
> +		char *src_line;
> +
> +		src_code = strtok_r(freq[i].line, "@", &state);

Does verifier guarantee presence of '@' for each source comment line?

> +		src_line = strtok_r(NULL, "\0", &state);
> +		if (src_line)

The '.line' string is null-terminated, can 'src_line' ever be NULL?

> +			printf("%5d: (%s)\t%s\n", freq[i].cnt, src_line + 1, src_code);
> +		else
> +			printf("%5d: %s\n", freq[i].cnt, src_code);
> +	}
> +
> +cleanup:
> +	free(freq);
> +	free(lines);
> +	return err;
> +}
> +

[...]