On 9/23/21 10:02 PM, Andrii Nakryiko wrote: > On Thu, Sep 23, 2021 at 6:27 PM Dave Marchevsky <davemarchevsky@xxxxxx> wrote: >> >> On 9/23/21 4:51 PM, Alexei Starovoitov wrote: >>> On Mon, Sep 20, 2021 at 08:11:10AM -0700, Dave Marchevsky wrote: >>>> The verifier currently logs some useful statistics in >>>> print_verification_stats. Although the text log is an effective feedback >>>> tool for an engineer iterating on a single application, it would also be >>>> useful to enable tracking these stats in a more structured form for >>>> fleetwide or historical analysis, which this patchset attempts to do. >>>> >>>> A concrete motivating usecase which came up in recent weeks: >>>> >>>> A team owns a complex BPF program, with various folks extending its >>>> functionality over the years. An engineer tries to make a relatively >>>> simple addition but encounters "BPF program is too large. Processed >>>> 1000001 insn". >>>> >>>> Their changes bumped the processed insns from 700k to over the limit and >>>> there's no obvious way to simplify. They must now consider a large >>>> refactor in order to incorporate the new feature. What if there was some >>>> previous change which bumped processed insns from 200k->700k which >>>> _could_ be modified to stress verifier less? Tracking historical >>>> verifier stats for each version of the program over the years would >>>> reduce manual work necessary to find such a change. >>>> >>>> >>>> Although parsing the text log could work for this scenario, a solution >>>> that's resilient to log format and other verifier changes would be >>>> preferable. >>>> >>>> This patchset adds a bpf_prog_verif_stats struct - containing the same >>>> data logged by print_verification_stats - which can be retrieved as part >>>> of bpf_prog_info. Looking for general feedback on approach and a few >>>> specific areas before fleshing it out further: >>>> >>>> * None of my usecases require storing verif_stats for the lifetime of a >>>> loaded prog, but adding to bpf_prog_aux felt more correct than trying >>>> to pass verif_stats back as part of BPF_PROG_LOAD >>>> * The verif_stats are probably not generally useful enough to warrant >>>> inclusion in fdinfo, but hoping to get confirmation before removing >>>> that change in patch 1 >>>> * processed_insn, verification_time, and total_states are immediately >>>> useful for me, rest were added for parity with >>>> print_verification_stats. Can remove. >>>> * Perhaps a version field would be useful in verif_stats in case future >>>> verifier changes make some current stats meaningless >>>> * Note: stack_depth stat was intentionally skipped to keep patch 1 >>>> simple. Will add if approach looks good. >>> >>> Sorry for the delay. LPC consumes a lot of mental energy :) >>> >>> I see the value of exposing some of the verification stats as prog_info. >>> Let's look at the list: >>> struct bpf_prog_verif_stats { >>> __u64 verification_time; >>> __u32 insn_processed; >>> __u32 max_states_per_insn; >>> __u32 total_states; >>> __u32 peak_states; >>> __u32 longest_mark_read_walk; >>> }; >>> verification_time is non deterministic. It varies with frequency >>> and run-to-run. I don't see how alerting tools can use it. >> >> Makes sense to me, will get rid of it. >> >>> insn_processed is indeed the main verification metric. >>> By now it's well known and understood. >>> >>> max_states_per_insn, total_states, etc were the metrics I've studied >>> carefully with pruning, back tracking and pretty much every significant >>> change I did or reiviewed in the verifier. They're useful to humans >>> and developers, but I don't see how alerting tools will use them. >>> >>> So it feels to me that insn_processed alone will be enough to address the >>> monitoring goal. >> >> For the concrete usecase in my original message insn_processed would be >> enough. For the others - I thought there might be value in gathering >> those "fleetwide" to inform verifier development, e.g.: >> >> "Hmm, this team's libbpf program has been regressing total_states over >> past few {kernel, llvm} rollouts, but they haven't been modifying it. >> Let's try to get a minimal repro, send to bpf@vger, and contribute to >> selftests if it is indeed hitting a weird verifier edge case" >> >> So for those I'm not expecting them to be useful to alert on or be a >> number that the average BPF program writer needs to care about. >> >> Of course this is hypothetical as I haven't tried to gather such data >> and look for interesting patterns. But these metrics being useful to >> you when looking at significant verifier changes is a good sign. > > One reason to not add all those fields is to not end up with > meaningless stats (in the future) in UAPI. One way to work around that > is to make it "unstable" by providing it through raw_tracepoint as > internal kernel struct. > > Basically, the proposal would be: add new tracepoint for when BPF > program is verified, either successfully or not. As one of the > parameters provide stats struct which is internal to BPF verifier and > is not exposed through UAPI. > > Such tracepoint actually would be useful more generally as well, e.g., > to monitor which programs are verified in the fleet, what's the rate > of success/failure (to detect verifier regression), what are the stats > (verification time actually would be good to have there, again for > stats and detecting regression), etc, etc. > > WDYT? > Seems reasonable to me - and attaching a BPF program to the tracepoint to grab data is delightfully meta :) I'll do a pass on alternate implementation with _just_ tracepoint, no prog_info or fdinfo, can add minimal or full stats to those later if necessary. >> >>> It can be exposed to fd_info and printed by bpftool. >>> If/when it changes with some future verifier algorithm we should be able >>> to approximate it. >>> >>