From: Ori Glassman <ori.glassman@xxxxxxxxxxx> Sent: Monday, January 30, 2023 8:45 PM To: Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx>; Daniel Borkmann <daniel@xxxxxxxxxxxxx>; bpf <bpf@xxxxxxxxxxxxxxx> Cc: KP Singh <kpsingh@xxxxxxxxxxxx> Subject: Re: Interruptable eBPF programs > From: Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> > Sent: Monday, January 30, 2023 6:42 PM > To: Daniel Borkmann <daniel@xxxxxxxxxxxxx>; bpf <bpf@xxxxxxxxxxxxxxx> > Cc: Ori Glassman <ori.glassman@xxxxxxxxxxx>; security@xxxxxxxxxx <security@xxxxxxxxxx>; KP Singh <kpsingh@xxxxxxxxxxxx>; Alexei Starovoitov <ast@xxxxxxxxxx> > Subject: Re: Interruptable eBPF programs > On Mon, Jan 30, 2023 at 8:01 AM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: > > > > On 1/30/23 1:12 PM, Ori Glassman wrote: > > > Hi everyone, > > > > > > > > > Since the patch to disable process migration instead of disabling preemption in eBPF programs, the latter are potentially susceptible to a reschedule mid-execution. > Ori, > security@kernel alias is not the place to ask bpf related questions. I apologise if it was confusing, I wasn't asking a question - rather raising a security concern of mine. > Yaniv, from your team, already brought it up here: > https://lore.kernel.org/bpf/CAADnVQ++LzKt9Q-GtGXknVBqyMqY=vLJ3tR3NNGG3P66gvVCFQ@xxxxxxxxxxxxxx/ > You cannot assume that different bpf progs attached to various > events like tracepoints and lsm hooks won't overlap. > It's a bug in your program. Nothing else. How can one use bpf_task_storage_get() without the risk of getting corrupted? Say there's a module that consists of 1 simple program, a single LSM hook on bprm_creds_for_exec, that uses a local task storage pinned map. An attacker at some point in the future loads his tracepoint program, and maliciously corrupts the local storage *while* the LSM hook is executing, not after or before. What's the bug in the program that consists of the single LSM hook? > > Thanks for the report, Ori. We'll take a look and get back to you (also Cc'ing KP and Alexei > > for visibility in here). > > > > > bpf_perf_event_output helper in an eBPF program, can trigger irq_work->arch_irq_work_raise, which will send an ipi to the current CPU, which may reschedule. > > > > > > By hooking a tracepoint such as sched_process_free that runs within an interrupt context, an execution flow (of a certain CPU) like the following may occur: > > > - start execution of lsm/<some_func> and executing bpf_perf_event_output within the hook code > > > - while executing bpf_perf_event_output, gets rescheduled and runs tp/sched_process_free > > > - tp/sched_process_free returns and the CPU continues execution of lsm/<some_func> > > > > > > Using per-CPU data is a known issue in this kind of environment [1], this is also relevant for per-task local storage implemented in v5.11 [2]. > > > > > > This is risky in general but becomes particularly dangerous when used in LSM modules (such as apparmor), since the block/allow logic may rely on shared storage that may be manipulated mid-execution from the interrupt context and corrupt the decision of the module. > > > A very un-harm example of such usage can be seen in bpf selftests [3]. > > > > > > This becomes a vulnerability if an LSM hook uses a pinned per-task local storage - which is not expected to get corrupted mid execution, since an attacker may load a program such as tp/sched_process_free, and bypass silently the LSM hook by corrupting the self-storage. The flaw is in the per-task local storage, which is not reliable. > > > > > > An example: > > > --------------------------------------------------------------------------- > > > // User loads the following program: > > > SEC("lsm/bprm_creds_for_exec") > > > int BPF_PROG(secure_exec, struct linux_binprm *bprm) > > > { > > > int *secureexec; > > > secureexec = bpf_task_storage_get(&per_task_map, bpf_get_current_task_btf(), 0, BPF_LOCAL_STORAGE_GET_F_CREATE); // assume per_task_map is pinned > > > > > > if (secureexec && STR_EQUALS(bprm->filename, "some_virus")) // assume this condition is fulfilled > > > *secureexec = 1; > > > > > > // secureexec is now 1 > > > ... > > > ... > > > ... > > > bpf_perf_event_output(); > > > ... > > > ... > > > ... > > > > > > // secureexec is expected to be 1, but was changed from the interrupted context and is now 0 > > > bpf_bprm_opts_set(bprm, *secureexec); > > > > > > // the binary "some_virus" will run without AT_SECURE > > > > > > return 0; > > > } > > > > > > // The attacker code: > > > SEC("raw_tracepoint/sched_process_free") > > > int tracepoint__sched_sched_process_free(struct bpf_raw_tracepoint_args *ctx) > > > { > > > int *secureexec; > > > secureexec = bpf_task_storage_get(&per_task_map, bpf_get_current_task_btf(), 0, BPF_LOCAL_STORAGE_GET_F_CREATE); // uses per_task_map pinned map that was defined in a different eBPF program > > > if (secureexec) > > > *secureexec = 0; // always turn off secureexec > > > > > > return 0; > > > } > These two programs access some task local storage. I'm talking specifically when the programs are executed by the same task and thus accessing the same local storage. > This code racy regardless of preempt_disable vs migrate_disable. > bpf_task_storage_get() of the same task can run on different cpus. Not at the same time though, right? I'm not concerned about the cases where the map is used in multiple programs - I'm concerned about the cases where it's used locally in a single program, but gets corrupted in a timely manner from the outside by an attacker. > Whether trace_sched_process_free and security_bprm_creds_for_exec > can happen on different cpus is kernel implementation detail. > There looks to be another bug in the above: > doing bpf_get_current_task_btf from raw_tracepoint/sched_process_free > will return task_struct of the worker thread. > I don't think it's the one you want. That's not what I observed - this is the output of bpf_trace_printk where the execution of the LSM hook got interrupted mid-execution: ---- chrony-onofflin-12460 [000] d.s.1 2258.804195: bpf_trace_printk: EXECUTION HIJACK(b=2257261931167) # this is from tp/sched_process_free chrony-onofflin-12460 [000] d...1 2258.804234: bpf_trace_printk: a=2257261896666. c=2257261971220 # from the lsm hook ----- > Anyway, please start a clean thread at bpf@vger with bpf questions. > Don't spam security@kernel. > --------------------------------------------------------------------------- > > > > > > Similarly, in cases where the LSM hook decides the return value (deny/allow) based on data from the local storage, this can also be altered by the attacker, resulting in doing a malicious action that the LSM module should normally block. > > > > > > > > > I'd like to propose a tentative public disclosure date on 02/06/2023 12:00UTC. > You must be kidding. > > > > > > > > > [1] https://lwn.net/Articles/836503/ > > > [2] https://github.com/torvalds/linux/commit/4cf1bc1f10452065a29d576fc5693fc4fab5b919 > > > [3] https://elixir.bootlin.com/linux/latest/source/tools/testing/selftests/bpf/progs/bprm_opts.c#L24 > > > > > > > > > Thanks, > > > Ori > > > > >