Re: [PATCH bpf-next v9 04/10] bpf: Check potential private stack recursion for progs with async callback

Yonghong Song <yonghong.song@xxxxxxxxx> · Tue, 5 Nov 2024 13:26:14 -0800

On 11/5/24 12:26 PM, Alexei Starovoitov wrote:
On Mon, Nov 4, 2024 at 7:37 PM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:

On 11/4/24 6:51 PM, Alexei Starovoitov wrote:
On Mon, Nov 4, 2024 at 11:38 AM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
In previous patch, tracing progs are enabled for private stack since
recursion checking ensures there exists no nested same bpf prog run on
the same cpu.

But it is still possible for nested bpf subprog run on the same cpu
if the same subprog is called in both main prog and async callback,
or in different async callbacks. For example,
    main_prog
     bpf_timer_set_callback(timer, timer_cb);
     call sub1
    sub1
     ...
    time_cb
     call sub1

In the above case, nested subprog run for sub1 is possible with one in
process context and the other in softirq context. If this is the case,
the verifier will disable private stack for this bpf prog.

Signed-off-by: Yonghong Song <yonghong.song@xxxxxxxxx>
---
   include/linux/bpf_verifier.h |  2 ++
   kernel/bpf/verifier.c        | 42 +++++++++++++++++++++++++++++++-----
   2 files changed, 39 insertions(+), 5 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 0622c11a7e19..e921589abc72 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -669,6 +669,8 @@ struct bpf_subprog_info {
          /* true if bpf_fastcall stack region is used by functions that can't be inlined */
          bool keep_fastcall_stack: 1;
          bool use_priv_stack: 1;
+       bool visited_with_priv_stack_accum: 1;
+       bool visited_with_priv_stack: 1;

          u8 arg_cnt;
          struct bpf_subprog_arg_info args[MAX_BPF_FUNC_REG_ARGS];
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 406195c433ea..e01b3f0fd314 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -6118,8 +6118,12 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx,
                                          idx, subprog_depth);
                                  return -EACCES;
                          }
-                       if (subprog_depth >= BPF_PRIV_STACK_MIN_SIZE)
+                       if (subprog_depth >= BPF_PRIV_STACK_MIN_SIZE) {
                                  subprog[idx].use_priv_stack = true;
+                               subprog[idx].visited_with_priv_stack = true;
+                       }
+               } else {
+                       subprog[idx].visited_with_priv_stack = true;
See suggestion for patch 3.
It's cleaner to rewrite with a single visited_with_priv_stack = true; statement.
Ack.

                  }
          }
   continue_func:
@@ -6220,10 +6224,12 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx,
   static int check_max_stack_depth(struct bpf_verifier_env *env)
   {
          struct bpf_subprog_info *si = env->subprog_info;
+       enum priv_stack_mode orig_priv_stack_supported;
          enum priv_stack_mode priv_stack_supported;
          int ret, subtree_depth = 0, depth_frame;

          priv_stack_supported = bpf_enable_priv_stack(env->prog);
+       orig_priv_stack_supported = priv_stack_supported;

          if (priv_stack_supported != NO_PRIV_STACK) {
                  for (int i = 0; i < env->subprog_cnt; i++) {
@@ -6240,13 +6246,39 @@ static int check_max_stack_depth(struct bpf_verifier_env *env)
                                                              priv_stack_supported);
                          if (ret < 0)
                                  return ret;
+
+                       if (priv_stack_supported != NO_PRIV_STACK) {
+                               for (int j = 0; j < env->subprog_cnt; j++) {
+                                       if (si[j].visited_with_priv_stack_accum &&
+                                           si[j].visited_with_priv_stack) {
+                                               /* si[j] is visited by both main/async subprog
+                                                * and another async subprog.
+                                                */
+                                               priv_stack_supported = NO_PRIV_STACK;
+                                               break;
+                                       }
+                                       if (!si[j].visited_with_priv_stack_accum)
+                                               si[j].visited_with_priv_stack_accum =
+                                                       si[j].visited_with_priv_stack;
+                               }
+                       }
+                       if (priv_stack_supported != NO_PRIV_STACK) {
+                               for (int j = 0; j < env->subprog_cnt; j++)
+                                       si[j].visited_with_priv_stack = false;
+                       }
I cannot understand what this algorithm is doing.
What is the meaning of visited_with_priv_stack_accum ?
The following is an example to show how the algorithm works.
Let us say we have prog like
     main_prog0  si[0]
       sub1      si[1]
       sub2      si[2]
     async1      si[3]
       sub4      si[4]
       sub2      si[2]
     async2      si[5]
       sub4      si[4]
       sub5      si[6]


Total 9 subprograms.

after iteration 1 (main_prog0)
     visited_with_priv_stack_accum: si[i] = false for i = 0 ... 9
     visited_with_priv_stack: si[0] = si[1] = si[2] = true, others false

     for all i, visited_with_priv_stack_accum[i] and visited_with_priv_stack[i]
     is false, so main_prog0 can use priv stack.

     visited_with_priv_stack_accum: si[0] = si[1] = si[2] = true; others false
     visited_with_priv_stack cleared with false.

after iteration 2 (async1)
     visited_with_priv_stack_accum: si[0] = si[1] = si[2] = true; others false
     visited_with_priv_stack: si[2] = si[3] = si[4] = true, others false

     Here, si[2] appears in both visited_with_priv_stack_accum and
     visited_with_priv_stack, so async1 cannot have priv stack.

     In my algorithm, I flipped the whole thing to no_priv_stack, which is
     too conservative. We should just skip async1 and continues.

     Let us say, we say async1 not having priv stack while main_prog0 has.

     /* the same as end of iteration 1 */
     visited_with_priv_stack_accum: si[0] = si[1] = si[2] = true; others false
     visited_with_priv_stack cleared with false.

after iteration 3 (async2)
     visited_with_priv_stack_accum: si[0] = si[1] = si[2] = true; others false
     visited_with_priv_stack: si[4] = si[5] = si[6] = true;

     there are no conflict, so async2 can use private stack.


If we only have one bit in bpf_subprog_info, for a async tree,
if marking a subprog to be true and later we found there is a conflict in
async tree and we need make the whole async subprogs not eligible for priv stack,
then it will be hard to undo previous markings.

So visited_with_priv_stack_accum is to accumulate "true" results from
main_prog/async's.
I see. I think it works, but feels complicated.
It feels it should be possible to do without extra flags. Like
check_max_stack_depth_subprog() will know whether it was called
to verify async_cb or not.
So it's just a matter of adding single 'if' to it:
if (subprog[idx].use_priv_stack && checking_async_cb)
    /* reset to false due to potential recursion */
    subprog[idx].use_priv_stack = false;

check_max_stack_depth() starts with i==0,
so reachable and eligible subprogs will be marked with use_priv_stack.
Then check_max_stack_depth_subprog() will be called again
to verify async. If it sees the mark it's a bad case.
what am I missing?

First I think we still want to mark some subprogs in async tree
to use private stack, right? If this is the case, then let us see
the following examle:

main_prog:
   sub1: use_priv_stack = true
   sub2" use_priv_stack = true

async: /* calling sub1 twice */
   sub1
     <=== we do
            if (subprog[idx].use_priv_stack && checking_async_cb)
                subprog[idx].use_priv_stack = false;
   sub1
     <=== here we have subprog[idx].use_priv_stack = false;
          we could mark use_priv_stack = true again here
          since logic didn't keep track of sub1 has been
          visited before.

To solve the above issue, we need one visited bit in bpf_subprog_info.
After finishing async tree, if for any subprog,
  visited_bit && subprog[idx].use_priv_stack
is true, we can mark subprog[idx].use_priv_stack = false

So one visited bit is enough.

More complicated case is two asyncs. For example:

main_prog:
  sub1
  sub2

async1:
  sub3

async2:
  sub3

If async1/sub3 and async2/sub3 can be nested, then we will
need two visited bits as I have above.
If async1/sub3 and async2/sub3 cannot be nested, then one
visited bit should be enough, since we can traverse
async1/async2 with 'visited' marking and then compare against
main prog.

So the question would be:
  1. Is it possible that two async call backs may nest with
     each other? I actually do not know the answer.
  2. Do we want to allow subprogs in async tree to use private
     stacks?