Re: [PATCH bpf 1/2] bpf: Avoid iter->offset making backward progress in bpf_iter_udp

Martin KaFai Lau <martin.lau@xxxxxxxxx> · Thu, 21 Dec 2023 14:19:07 -0800

On 12/21/23 12:27 PM, Daniel Borkmann wrote:
On 12/21/23 3:58 PM, Martin KaFai Lau wrote:
On 12/21/23 5:21 AM, Daniel Borkmann wrote:
On 12/21/23 5:45 AM, Martin KaFai Lau wrote:
On 12/20/23 11:10 AM, Martin KaFai Lau wrote:
Good catch. It will unnecessary skip in the following batch/bucket if there 
is changes in the current batch/bucket.

 From looking at the loop again, I think it is better not to change the 
iter->offset during the for loop. Only update iter->offset after the for 
loop has concluded.

The non-zero iter->offset is only useful for the first bucket, so does a 
test on the first bucket (state->bucket == bucket) before skipping sockets. 
Something like this:

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 89e5a806b82e..a993f364d6ae 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -3139,6 +3139,7 @@ static struct sock *bpf_iter_udp_batch(struct 
seq_file *seq)
      struct net *net = seq_file_net(seq);
      struct udp_table *udptable;
      unsigned int batch_sks = 0;
+    int bucket, bucket_offset;
      bool resized = false;
      struct sock *sk;

@@ -3162,14 +3163,14 @@ static struct sock *bpf_iter_udp_batch(struct 
seq_file *seq)
      iter->end_sk = 0;
      iter->st_bucket_done = false;
      batch_sks = 0;
+    bucket = state->bucket;
+    bucket_offset = 0;

      for (; state->bucket <= udptable->mask; state->bucket++) {
          struct udp_hslot *hslot2 = &udptable->hash2[state->bucket];

-        if (hlist_empty(&hslot2->head)) {
-            iter->offset = 0;
+        if (hlist_empty(&hslot2->head))
              continue;
-        }

          spin_lock_bh(&hslot2->lock);
          udp_portaddr_for_each_entry(sk, &hslot2->head) {
@@ -3177,8 +3178,9 @@ static struct sock *bpf_iter_udp_batch(struct 
seq_file *seq)
                  /* Resume from the last iterated socket at the
                   * offset in the bucket before iterator was stopped.
                   */
-                if (iter->offset) {
-                    --iter->offset;
+                if (state->bucket == bucket &&
+                    bucket_offset < iter->offset) {
+                    ++bucket_offset;
                      continue;
                  }
                  if (iter->end_sk < iter->max_sk) {
@@ -3192,10 +3194,10 @@ static struct sock *bpf_iter_udp_batch(struct 
seq_file *seq)

          if (iter->end_sk)
              break;
+    }

-        /* Reset the current bucket's offset before moving to the next 
bucket. */
+    if (state->bucket != bucket)
          iter->offset = 0;
-    }

      /* All done: no batch made. */
      if (!iter->end_sk)

I think I found another bug in the current bpf_iter_udp_batch(). The 
"state->bucket--;" at the end of the batch() function is wrong also. It does 
not need to go back to the previous bucket. After realloc with a larger 
batch array, it should retry on the "state->bucket" as is. I tried to force 
the bind() to use bucket 0 and bind a larger so_reuseport set (24 sockets). 
WARN_ON(state->bucket < 0) triggered.

Going back to this bug (backward progress on --iter->offset), I think it is 
a bit cleaner to always reset iter->offset to 0 and advance iter->offset to 
the resume_offset only when needed. Something like this:

Hm, my assumption was.. why not do something like the below, and fully start 
over?

I'm mostly puzzled about the side-effects here, in particular, if for the 
rerun the sockets
in the bucket could already have changed.. maybe I'm still missing something 
- what do
we need to deal with exactly worst case when we need to go and retry 
everything, and what
guarantees do we have?

(only compile tested)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 89e5a806b82e..ca62a4bb7bec 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -3138,7 +3138,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file 
*seq)
      struct udp_iter_state *state = &iter->state;
      struct net *net = seq_file_net(seq);
      struct udp_table *udptable;
-    unsigned int batch_sks = 0;
+    int orig_bucket, orig_offset;
+    unsigned int i, batch_sks = 0;
      bool resized = false;
      struct sock *sk;

@@ -3149,7 +3150,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file 
*seq)
      }

      udptable = udp_get_table_seq(seq, net);
-
+    orig_bucket = state->bucket;
+    orig_offset = iter->offset;
  again:
      /* New batch for the next bucket.
       * Iterate over the hash table to find a bucket with sockets matching
@@ -3211,9 +3213,15 @@ static struct sock *bpf_iter_udp_batch(struct seq_file 
*seq)
      if (!resized && !bpf_iter_udp_realloc_batch(iter, batch_sks * 3 / 2)) {
          resized = true;
          /* After allocating a larger batch, retry one more time to grab
-         * the whole bucket.
+         * the whole bucket. Drop the current refs since for the next
+         * attempt the composition could have changed, thus start over.
           */
-        state->bucket--;
+        for (i = 0; i < iter->end_sk; i++) {
+            sock_put(iter->batch[i]);
+            iter->batch[i] = NULL;
+        }
+        state->bucket = orig_bucket;
+        iter->offset = orig_offset;

It does not need to start over from the orig_bucket. Once it advanced to the 
next bucket (state->bucket++), the orig_bucket is done. Otherwise, it may need 
to make backward progress here on the state->bucket. The batch size too small 
happens on the current state->bucket, so it should retry with the same 
state->bucket after realloc_batch(). If the state->bucket happens to be the 
orig_bucket (mean it has not advanced), it will skip the same orig_offset.

If the orig_bucket had changed (e.g. having more sockets than the last time it 
was batched) after state->bucket++, it is arguably fine because it was added 
after the orig_bucket was completely captured in a batch before. The same goes 
for (orig_bucket-1) that could have changed during the whole udp_table iteration.

Fair, I was thinking in relation to the above to avoid such inconsistency, hence 
the drop of the refs. 
Ah, for the ref drop, the bpf_iter_udp_realloc_batch() earlier did the sock_put 
on the iter->batch[]. It is done in the bpf_iter_udp_put_batch(). When it 
retries with a larger iter->batch[] array, it does retry from the very first 
udp_sk of the state->bucket again. Thus, the intention is to do the best effort 
to capture the whole bucket under the "spin_lock_bh(&hslot2->lock)".

I think it's probably fine to argue either way on how semantics
should be as long as the code doesn't get overly complex for covering this corner
case.

Thanks,
Daniel