Re: [PATCH sched_ext/for-6.11 2/2] sched_ext: Implement scx_bpf_consume_task()

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Fri, 28 Jun 2024 15:34:01 -0700

On Fri, Jun 28, 2024 at 3:08 PM Tejun Heo <tj@xxxxxxxxxx> wrote:
>
> Hello, Alexei.
>
> On Thu, Jun 27, 2024 at 06:34:14PM -0700, Alexei Starovoitov wrote:
> ...
> > > +__bpf_kfunc bool __scx_bpf_consume_task(unsigned long it, struct task_struct *p)
> > > +{
> > > +       struct bpf_iter_scx_dsq_kern *kit = (void *)it;
> > > +       struct scx_dispatch_q *dsq, *kit_dsq;
> > > +       struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx);
> > > +       struct rq *task_rq;
> > > +       u64 kit_dsq_seq;
> > > +
> > > +       /* can't trust @kit, carefully fetch the values we need */
> > > +       if (get_kernel_nofault(kit_dsq, &kit->dsq) ||
> > > +           get_kernel_nofault(kit_dsq_seq, &kit->dsq_seq)) {
> > > +               scx_ops_error("invalid @it 0x%lx", it);
> > > +               return false;
> > > +       }
> >
> > With scx_bpf_consume_task() it's only a compile time protection from bugs.
> > Since kfunc doesn't dereference any field in kit_dsq it won't crash
> > immediately, but let's figure out how to make it work properly.
> >
> > Since kit_dsq and kit_dsq_seq are pretty much anything in this implementation
> > can they be passed as two scalars instead ?
> > I guess not, since tricking dsq != kit_dsq and
> > time_after64(..,kit_dsq_seq) can lead to real issues ?
>
> That actually should be okay. It can lead to real but not crashing issues.
> The system integrity is going to be fine no matter what the passed in seq
> value is. It can just lead to confusing behaviors from the BPF scheduler's
> POV, so it's fine to put the onus on the BPF scheduler.
>
> > Can some of it be mitigated by passing dsq into kfunc that
> > was used to init the iter ?
> > Then kfunc will read dsq->seq from it instead of kit->dsq_seq ?
>
> I don't quite follow this part. bpf_iter_scx_dsq_new() takes @dsq_id. The
> function looks up the matching DSQ and then the iterator remembers the
> current dsq->seq which serves as the threshold (tasks queued afterwards are
> ignored). ie. The value needs to be copied at that point to guarantee that
> iteration ignores tasks that are queued after the iteration started.
>
> > > +       /*
> > > +        * Did someone else get to it? @p could have already left $dsq, got
> > > +        * re-enqueud, or be in the process of being consumed by someone else.
> > > +        */
> > > +       if (unlikely(p->scx.dsq != dsq ||
> > > +                    time_after64(p->scx.dsq_seq, kit_dsq_seq) ||
> >
> > In the previous patch you do:
> > (s32)(p->scx.dsq_seq - kit->dsq_seq) > 0
> > and here
> > time_after64().
> > Close enough, but 32 vs 64 and equality difference?
>
> Sorry about the sloppiness. It was originally u64 and then I forgot to
> update here after changing them to u32. I'll add a helper for the comparison
> and update both sites.
>
> Going back to the sequence number barrier, it's a sort of scoping and one
> way to solve it is adding an explicit helper to fetch the target DSQ's
> sequence number and then pass it to the consume_task function. ie. sth like:
>
>         barrier_seq = scx_bpf_dsq_seq(dsq_id);
>         bpf_for_each(scx_dsq, p, dsq_id, 0) {
>                 ...
>                 scx_bpf_consume_task(p, dsq_id, barrier_seq);
>         }
>
> This should work but it's not as neat in that it now involves three dsq_id
> -> DSQ lookups. Also, there's extra subtlety arising from @barrier_seq being
> different from the barrier seq that the scx_dsq iterator would be using.

maybe a stupid question, but why scx_dsq iterator cannot accept
scx_dispatch_q pointer directly instead of dsq_id and then doing
lookup? I.e., what if you had a kfunc to do dsq_id -> scx_dispatch_q
lookup (returning PTR_TRUSTED instance), and then you can pass that to
iterator, you can pass that to scx_bpf_consume_task() kfunc.

Technically, you can also have another kfunc accepting scx_dispatch_q
and returning current "timestamp" as some special type (TRUSTED and
all), which will be passed into consume_task() as well.

Is this too explicit in terms of types?

>
> As a DSQ iteration needs to have its own barrier sequence, maybe the answer
> is to require passing it in as an explicit parameter. ie.:
>
>         barrier_seq = scx_bpf_dsq_seq(dsq_id);
>         bpf_for_each(scx_dsq, p, dsq_id, barrier_seq, 0) {
>                 ...
>                 scx_bpf_consume_task(p, dsq_id, barrier_seq);
>         }
>
> There still are three dsq_id lookups but at least there is just one sequence
> number in play. It is more cumbersome tho compared to the current interface:
>
>         bpf_for_each(scx_dsq, p, dsq_id, 0) {
>                 ...
>                 scx_bpf_consume_task(BPF_FOR_EACH_ITER, p);
>         }
>
> What do you think?
>
> Thanks.
>
> --
> tejun
>