Re: [PATCH bpf-next v1 03/19] bpf: add bpf_map iterator

Yonghong Song <yhs@xxxxxx> · Wed, 29 Apr 2020 13:15:02 -0700

On 4/29/20 12:19 PM, Andrii Nakryiko wrote:
On Wed, Apr 29, 2020 at 8:34 AM Alexei Starovoitov <ast@xxxxxx> wrote:

On 4/28/20 11:44 PM, Yonghong Song wrote:

On 4/28/20 11:40 PM, Andrii Nakryiko wrote:
On Tue, Apr 28, 2020 at 11:30 PM Alexei Starovoitov <ast@xxxxxx> wrote:

On 4/28/20 11:20 PM, Yonghong Song wrote:

On 4/28/20 11:08 PM, Andrii Nakryiko wrote:
On Tue, Apr 28, 2020 at 10:10 PM Yonghong Song <yhs@xxxxxx> wrote:

On 4/28/20 7:44 PM, Alexei Starovoitov wrote:
On 4/28/20 6:15 PM, Yonghong Song wrote:

On 4/28/20 5:48 PM, Alexei Starovoitov wrote:
On 4/28/20 5:37 PM, Martin KaFai Lau wrote:
+    prog = bpf_iter_get_prog(seq, sizeof(struct
bpf_iter_seq_map_info),
+                 &meta.session_id, &meta.seq_num,
+                 v == (void *)0);
    From looking at seq_file.c, when will show() be called with
"v ==
NULL"?

that v == NULL here and the whole verifier change just to allow
NULL...
may be use seq_num as an indicator of the last elem instead?
Like seq_num with upper bit set to indicate that it's last?

We could. But then verifier won't have an easy way to verify that.
For example, the above is expected:

         int prog(struct bpf_map *map, u64 seq_num) {
            if (seq_num >> 63)
              return 0;
            ... map->id ...
            ... map->user_cnt ...
         }

But if user writes

         int prog(struct bpf_map *map, u64 seq_num) {
             ... map->id ...
             ... map->user_cnt ...
         }

verifier won't be easy to conclude inproper map pointer tracing
here and in the above map->id, map->user_cnt will cause
exceptions and they will silently get value 0.

I mean always pass valid object pointer into the prog.
In above case 'map' will always be valid.
Consider prog that iterating all map elements.
It's weird that the prog would always need to do
if (map == 0)
      goto out;
even if it doesn't care about finding last.
All progs would have to have such extra 'if'.
If we always pass valid object than there is no need
for such extra checks inside the prog.
First and last element can be indicated via seq_num
or via another flag or via helper call like is_this_last_elem()
or something.

Okay, I see what you mean now. Basically this means
seq_ops->next() should try to get/maintain next two elements,

What about the case when there are no elements to iterate to begin
with? In that case, we still need to call bpf_prog for (empty)
post-aggregation, but we have no valid element... For bpf_map
iteration we could have fake empty bpf_map that would be passed, but
I'm not sure it's applicable for any time of object (e.g., having a
fake task_struct is probably quite a bit more problematic?)...

Oh, yes, thanks for reminding me of this. I put a call to
bpf_prog in seq_ops->stop() especially to handle no object
case. In that case, seq_ops->start() will return NULL,
seq_ops->next() won't be called, and then seq_ops->stop()
is called. My earlier attempt tries to hook with next()
and then find it not working in all cases.

wait a sec. seq_ops->stop() is not the end.
With lseek of seq_file it can be called multiple times.

Yes, I have taken care of this. when the object is NULL,
bpf program will be called. When the object is NULL again,
it won't be called. The private data remembers it has
been called with NULL.

Even without lseek stop() will be called multiple times.
If I read seq_file.c correctly it will be called before
every copy_to_user(). Which means that for a lot of text
(or if read() is done with small buffer) there will be
plenty of start,show,show,stop sequences.

Right start/stop can be called multiple times, but seems like there
are clear indicators of beginning of iteration and end of iteration:
- start() with seq_num == 0 is start of iteration (can be called
multiple times, if first element overflows buffer);
- stop() with p == NULL is end of iteration (seems like can be called
multiple times as well, if user keeps read()'ing after iteration
completed).

There is another problem with stop(), though. If BPF program will
attempt to output anything during stop(), that output will be just
discarded. Not great. Especially if that output overflows and we need

The stop() output will not be discarded in the following cases:
   - regular show() objects overflow and stop() BPF program not called
   - regular show() objects not overflow, which means iteration is done,
     and stop() BPF program does not overflow.

The stop() seq_file output will be discarded if
   - regular show() objects not overflow and stop() BPF program output
     overflows.
   - no objects to iterate, BPF program got called, but its seq_file
     write/printf will be discarded.

Two options here:
  - implement Alexei suggestion to look ahead two elements to
    always having valid object and indicating the last element
    with a special flag.
  - Per Andrii's suggestion below to implement new way or to
    tweak seq_file() a little bit to resolve the above cases
    where stop() seq_file outputs being discarded.

Will try to experiment with both above options...

to re-allocate buffer.

We are trying to use seq_file just to reuse 140 lines of code in
seq_read(), which is no magic, just a simple double buffer and retry
piece of logic. We don't need lseek and traverse, we don't need all
the escaping stuff. I think bpf_iter implementation would be much
simpler if bpf_iter had better control over iteration. Then this whole
"end of iteration" behavior would be crystal clear. Should we maybe
reconsider again?

I understand we want to re-use networking iteration code, but we can
still do that with custom implementation of seq_read, because we are
still using struct seq_file and follow its semantics. The change would
be to allow stop(NULL) (or any stop() call for that matter) to perform
output (and handle retry and buffer re-allocation). Or, alternatively,
coupled with seq_operations intercept proposal in patch #7 discussion,
we can add extra method (e.g., finish()) that would be called after
all elements are traversed and will allow to emit extra stuff. We can
do that (implement finish()) in seq_read, as well, if that's going to
fly ok with seq_file maintainers, of course.