Hello, On Thu, Jun 27, 2024 at 06:11:48PM -0700, Alexei Starovoitov wrote: > > +struct bpf_iter_scx_dsq_kern { > > + struct scx_dsq_node cursor; > > + struct scx_dispatch_q *dsq; > > + u32 dsq_seq; > > + u32 flags; > > +} __attribute__((aligned(8))); > > + > > +struct bpf_iter_scx_dsq { > > + u64 __opaque[12]; > > +} __attribute__((aligned(8))); > > I think this is a bit too much to put on the prog stack. > Folks are working on increasing this limit and moving > the stack into "divided stack", so it won't be an issue eventually, > but let's find a way to reduce it. Yeah, it is kinda big. Do you have some idea on where the boundary between okay and too big would fall on? > It seems to me scx_dsq_node has a bunch of fields, > but if I'm reading the code correctly this patch is > only using cursor.list part of it ? Great point. Cursors used to have to go on the rbtrees too but that's no longer the case, so I should be able to drop the rbnode which should help reducing the size substantially. I'll see what I can do. > Another alternative is to use bpf_mem_alloc() like we do > in bpf_iter_css_task and others? This might be okay but given that this can be used pretty frequently (e.g. every scheduling event) and it *seems* possible to reduce its size substantially, I'd like to keep it on stack if possible. > > +__bpf_kfunc int bpf_iter_scx_dsq_new(struct bpf_iter_scx_dsq *it, u64 dsq_id, > > + u64 flags) > > +{ > > + struct bpf_iter_scx_dsq_kern *kit = (void *)it; > > + > > + BUILD_BUG_ON(sizeof(struct bpf_iter_scx_dsq_kern) > > > + sizeof(struct bpf_iter_scx_dsq)); > > + BUILD_BUG_ON(__alignof__(struct bpf_iter_scx_dsq_kern) != > > + __alignof__(struct bpf_iter_scx_dsq)); > > + > > + if (flags & ~__SCX_DSQ_ITER_ALL_FLAGS) > > + return -EINVAL; > > + > > + kit->dsq = find_non_local_dsq(dsq_id); > > + if (!kit->dsq) > > + return -ENOENT; > > + > > + INIT_LIST_HEAD(&kit->cursor.list); > > + RB_CLEAR_NODE(&kit->cursor.priq); > > + kit->cursor.flags = SCX_TASK_DSQ_CURSOR; > > Are these two assignments really necessary? > Something inside nldsq_next_task() is using that? > > > + kit->dsq_seq = READ_ONCE(kit->dsq->seq); > > + kit->flags = flags; I'm a bit confused whether you're referring to the statements above or below, but AFAICS, they're all used except for kit->cursor.priq. - SCX_TASK_DSQ_CURSOR assignment is what tells nldsq_next_task() that the node is a cursor, not a real task, and thus should be skipped for internal iterations. - kit->dsq_seq is used by bpf_iter_scx_dsq_next() to ignore tasks that are queued after the iteration has started. This, among other things, guarantees that p->scx.dsq_vtime increases monotonically throughout iteration. - kit->flags carries SCX_DSQ_ITER_REV which tells bpf_iter_scx_dsq_next() the direction of the iteration. Thanks. -- tejun