Re: [PATCH bpf-next v1 03/19] bpf: add bpf_map iterator

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Wed, 29 Apr 2020 12:19:18 -0700

On Wed, Apr 29, 2020 at 8:34 AM Alexei Starovoitov <ast@xxxxxx> wrote:
>
> On 4/28/20 11:44 PM, Yonghong Song wrote:
> >
> >
> > On 4/28/20 11:40 PM, Andrii Nakryiko wrote:
> >> On Tue, Apr 28, 2020 at 11:30 PM Alexei Starovoitov <ast@xxxxxx> wrote:
> >>>
> >>> On 4/28/20 11:20 PM, Yonghong Song wrote:
> >>>>
> >>>>
> >>>> On 4/28/20 11:08 PM, Andrii Nakryiko wrote:
> >>>>> On Tue, Apr 28, 2020 at 10:10 PM Yonghong Song <yhs@xxxxxx> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 4/28/20 7:44 PM, Alexei Starovoitov wrote:
> >>>>>>> On 4/28/20 6:15 PM, Yonghong Song wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 4/28/20 5:48 PM, Alexei Starovoitov wrote:
> >>>>>>>>> On 4/28/20 5:37 PM, Martin KaFai Lau wrote:
> >>>>>>>>>>> +    prog = bpf_iter_get_prog(seq, sizeof(struct
> >>>>>>>>>>> bpf_iter_seq_map_info),
> >>>>>>>>>>> +                 &meta.session_id, &meta.seq_num,
> >>>>>>>>>>> +                 v == (void *)0);
> >>>>>>>>>>    From looking at seq_file.c, when will show() be called with
> >>>>>>>>>> "v ==
> >>>>>>>>>> NULL"?
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> that v == NULL here and the whole verifier change just to allow
> >>>>>>>>> NULL...
> >>>>>>>>> may be use seq_num as an indicator of the last elem instead?
> >>>>>>>>> Like seq_num with upper bit set to indicate that it's last?
> >>>>>>>>
> >>>>>>>> We could. But then verifier won't have an easy way to verify that.
> >>>>>>>> For example, the above is expected:
> >>>>>>>>
> >>>>>>>>         int prog(struct bpf_map *map, u64 seq_num) {
> >>>>>>>>            if (seq_num >> 63)
> >>>>>>>>              return 0;
> >>>>>>>>            ... map->id ...
> >>>>>>>>            ... map->user_cnt ...
> >>>>>>>>         }
> >>>>>>>>
> >>>>>>>> But if user writes
> >>>>>>>>
> >>>>>>>>         int prog(struct bpf_map *map, u64 seq_num) {
> >>>>>>>>             ... map->id ...
> >>>>>>>>             ... map->user_cnt ...
> >>>>>>>>         }
> >>>>>>>>
> >>>>>>>> verifier won't be easy to conclude inproper map pointer tracing
> >>>>>>>> here and in the above map->id, map->user_cnt will cause
> >>>>>>>> exceptions and they will silently get value 0.
> >>>>>>>
> >>>>>>> I mean always pass valid object pointer into the prog.
> >>>>>>> In above case 'map' will always be valid.
> >>>>>>> Consider prog that iterating all map elements.
> >>>>>>> It's weird that the prog would always need to do
> >>>>>>> if (map == 0)
> >>>>>>>      goto out;
> >>>>>>> even if it doesn't care about finding last.
> >>>>>>> All progs would have to have such extra 'if'.
> >>>>>>> If we always pass valid object than there is no need
> >>>>>>> for such extra checks inside the prog.
> >>>>>>> First and last element can be indicated via seq_num
> >>>>>>> or via another flag or via helper call like is_this_last_elem()
> >>>>>>> or something.
> >>>>>>
> >>>>>> Okay, I see what you mean now. Basically this means
> >>>>>> seq_ops->next() should try to get/maintain next two elements,
> >>>>>
> >>>>> What about the case when there are no elements to iterate to begin
> >>>>> with? In that case, we still need to call bpf_prog for (empty)
> >>>>> post-aggregation, but we have no valid element... For bpf_map
> >>>>> iteration we could have fake empty bpf_map that would be passed, but
> >>>>> I'm not sure it's applicable for any time of object (e.g., having a
> >>>>> fake task_struct is probably quite a bit more problematic?)...
> >>>>
> >>>> Oh, yes, thanks for reminding me of this. I put a call to
> >>>> bpf_prog in seq_ops->stop() especially to handle no object
> >>>> case. In that case, seq_ops->start() will return NULL,
> >>>> seq_ops->next() won't be called, and then seq_ops->stop()
> >>>> is called. My earlier attempt tries to hook with next()
> >>>> and then find it not working in all cases.
> >>>
> >>> wait a sec. seq_ops->stop() is not the end.
> >>> With lseek of seq_file it can be called multiple times.
> >
> > Yes, I have taken care of this. when the object is NULL,
> > bpf program will be called. When the object is NULL again,
> > it won't be called. The private data remembers it has
> > been called with NULL.
>
> Even without lseek stop() will be called multiple times.
> If I read seq_file.c correctly it will be called before
> every copy_to_user(). Which means that for a lot of text
> (or if read() is done with small buffer) there will be
> plenty of start,show,show,stop sequences.

Right start/stop can be called multiple times, but seems like there
are clear indicators of beginning of iteration and end of iteration:
- start() with seq_num == 0 is start of iteration (can be called
multiple times, if first element overflows buffer);
- stop() with p == NULL is end of iteration (seems like can be called
multiple times as well, if user keeps read()'ing after iteration
completed).

There is another problem with stop(), though. If BPF program will
attempt to output anything during stop(), that output will be just
discarded. Not great. Especially if that output overflows and we need
to re-allocate buffer.

We are trying to use seq_file just to reuse 140 lines of code in
seq_read(), which is no magic, just a simple double buffer and retry
piece of logic. We don't need lseek and traverse, we don't need all
the escaping stuff. I think bpf_iter implementation would be much
simpler if bpf_iter had better control over iteration. Then this whole
"end of iteration" behavior would be crystal clear. Should we maybe
reconsider again?

I understand we want to re-use networking iteration code, but we can
still do that with custom implementation of seq_read, because we are
still using struct seq_file and follow its semantics. The change would
be to allow stop(NULL) (or any stop() call for that matter) to perform
output (and handle retry and buffer re-allocation). Or, alternatively,
coupled with seq_operations intercept proposal in patch #7 discussion,
we can add extra method (e.g., finish()) that would be called after
all elements are traversed and will allow to emit extra stuff. We can
do that (implement finish()) in seq_read, as well, if that's going to
fly ok with seq_file maintainers, of course.