Re: [PATCH bpf-next v1 03/19] bpf: add bpf_map iterator

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Wed, 29 Apr 2020 12:25:53 -0700

On Tue, Apr 28, 2020 at 11:51 PM Yonghong Song <yhs@xxxxxx> wrote:
>
>
>
> On 4/28/20 11:34 PM, Martin KaFai Lau wrote:
> > On Tue, Apr 28, 2020 at 11:20:30PM -0700, Yonghong Song wrote:
> >>
> >>
> >> On 4/28/20 11:08 PM, Andrii Nakryiko wrote:
> >>> On Tue, Apr 28, 2020 at 10:10 PM Yonghong Song <yhs@xxxxxx> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 4/28/20 7:44 PM, Alexei Starovoitov wrote:
> >>>>> On 4/28/20 6:15 PM, Yonghong Song wrote:
> >>>>>>
> >>>>>>
> >>>>>> On 4/28/20 5:48 PM, Alexei Starovoitov wrote:
> >>>>>>> On 4/28/20 5:37 PM, Martin KaFai Lau wrote:
> >>>>>>>>> +    prog = bpf_iter_get_prog(seq, sizeof(struct
> >>>>>>>>> bpf_iter_seq_map_info),
> >>>>>>>>> +                 &meta.session_id, &meta.seq_num,
> >>>>>>>>> +                 v == (void *)0);
> >>>>>>>>    From looking at seq_file.c, when will show() be called with "v ==
> >>>>>>>> NULL"?
> >>>>>>>>
> >>>>>>>
> >>>>>>> that v == NULL here and the whole verifier change just to allow NULL...
> >>>>>>> may be use seq_num as an indicator of the last elem instead?
> >>>>>>> Like seq_num with upper bit set to indicate that it's last?
> >>>>>>
> >>>>>> We could. But then verifier won't have an easy way to verify that.
> >>>>>> For example, the above is expected:
> >>>>>>
> >>>>>>         int prog(struct bpf_map *map, u64 seq_num) {
> >>>>>>            if (seq_num >> 63)
> >>>>>>              return 0;
> >>>>>>            ... map->id ...
> >>>>>>            ... map->user_cnt ...
> >>>>>>         }
> >>>>>>
> >>>>>> But if user writes
> >>>>>>
> >>>>>>         int prog(struct bpf_map *map, u64 seq_num) {
> >>>>>>             ... map->id ...
> >>>>>>             ... map->user_cnt ...
> >>>>>>         }
> >>>>>>
> >>>>>> verifier won't be easy to conclude inproper map pointer tracing
> >>>>>> here and in the above map->id, map->user_cnt will cause
> >>>>>> exceptions and they will silently get value 0.
> >>>>>
> >>>>> I mean always pass valid object pointer into the prog.
> >>>>> In above case 'map' will always be valid.
> >>>>> Consider prog that iterating all map elements.
> >>>>> It's weird that the prog would always need to do
> >>>>> if (map == 0)
> >>>>>      goto out;
> >>>>> even if it doesn't care about finding last.
> >>>>> All progs would have to have such extra 'if'.
> >>>>> If we always pass valid object than there is no need
> >>>>> for such extra checks inside the prog.
> >>>>> First and last element can be indicated via seq_num
> >>>>> or via another flag or via helper call like is_this_last_elem()
> >>>>> or something.
> >>>>
> >>>> Okay, I see what you mean now. Basically this means
> >>>> seq_ops->next() should try to get/maintain next two elements,
> >>>
> >>> What about the case when there are no elements to iterate to begin
> >>> with? In that case, we still need to call bpf_prog for (empty)
> >>> post-aggregation, but we have no valid element... For bpf_map
> >>> iteration we could have fake empty bpf_map that would be passed, but
> >>> I'm not sure it's applicable for any time of object (e.g., having a
> >>> fake task_struct is probably quite a bit more problematic?)...
> >>
> >> Oh, yes, thanks for reminding me of this. I put a call to
> >> bpf_prog in seq_ops->stop() especially to handle no object
> >> case. In that case, seq_ops->start() will return NULL,
> >> seq_ops->next() won't be called, and then seq_ops->stop()
> >> is called. My earlier attempt tries to hook with next()
> >> and then find it not working in all cases.
> >>
> >>>
> >>>> otherwise, we won't know whether the one in seq_ops->show()
> >>>> is the last or not.
> > I think "show()" is convoluted with "stop()/eof()".  Could "stop()/eof()"
> > be its own separate (and optional) bpf_prog which only does "stop()/eof()"?
>
> I thought this before. But user need to write a program instead of
> a simple "if" condition in the main program...
>

I agree with Yonghong, requiring user to check for null is pretty
trivial and verifier can give very clear error message if user didn't
check.
The PTR_TO_BTF_ID_OR_NULL seems useful in general as well, it's an
optional typed input arguments and might be useful in other
situations. Verifier changes don't seem excessive as well.

Having two coupled BPF programs to do single iteration becomes awkward
to manage, will complicate kernel interface (e.g., special variants of
LINK_CREATE and LINK_UPDATE) and libbpf implementation. It's also
going to be harder to replace them atomically. I think overall cons
outweight pros.

As one way to maybe simplify it for users a bit, we can make this
post-aggregation call optional with extra flag on BPF_PROG_LOAD.
Unless extra flag is specified, input arguments can stay PTR_TO_BTF_ID
and we'll just get non-NULL inputs and no "end of iteration" call.
With extra flags, inputs become PTR_TO_BTF_ID_OR_NULL and one extra
call at the end.

> >
> >>>> We could do it in newly implemented
> >>>> iterator bpf_map/task/task_file. Let me check how I could
> >>>> make existing seq_ops (ipv6_route/netlink) works with
> >>>> minimum changes.