On Wed, Aug 3, 2022 at 12:44 AM Yonghong Song <yhs@xxxxxx> wrote: > > > > On 8/1/22 10:54 AM, Hao Luo wrote: > > Cgroup_iter is a type of bpf_iter. It walks over cgroups in three modes: > > > > - walking a cgroup's descendants in pre-order. > > - walking a cgroup's descendants in post-order. > > - walking a cgroup's ancestors. > > > > When attaching cgroup_iter, one can set a cgroup to the iter_link > > created from attaching. This cgroup is passed as a file descriptor and > > serves as the starting point of the walk. If no cgroup is specified, > > the starting point will be the root cgroup. > > > > For walking descendants, one can specify the order: either pre-order or > > post-order. For walking ancestors, the walk starts at the specified > > cgroup and ends at the root. > > > > One can also terminate the walk early by returning 1 from the iter > > program. > > > > Note that because walking cgroup hierarchy holds cgroup_mutex, the iter > > program is called with cgroup_mutex held. > > > > Currently only one session is supported, which means, depending on the > > volume of data bpf program intends to send to user space, the number > > of cgroups that can be walked is limited. For example, given the current > > buffer size is 8 * PAGE_SIZE, if the program sends 64B data for each > > cgroup, assuming PAGE_SIZE is 4kb, the total number of cgroups that can > > be walked is 512. This is a limitation of cgroup_iter. If the output > > data is larger than the buffer size, the second read() will signal > > EOPNOTSUPP. In order to work around, the user may have to update their > > 'the second read() will signal EOPNOTSUPP' is not true. for bpf_iter, > we have user buffer from read() syscall and kernel buffer. The above > buffer size like 8 * PAGE_SIZE refers to the kernel buffer size. > > If read() syscall buffer size is less than kernel buffer size, > the second read() will not signal EOPNOTSUPP. So to make it precise, > we can say > If the output data is larger than the kernel buffer size, after > all data in the kernel buffer is consumed by user space, the > subsequent read() syscall will signal EOPNOTSUPP. > Thanks Yonghong. Will update. > > program to reduce the volume of data sent to output. For example, skip > > some uninteresting cgroups. In future, we may extend bpf_iter flags to > > allow customizing buffer size. > > > > Acked-by: Yonghong Song <yhs@xxxxxx> > > Acked-by: Tejun Heo <tj@xxxxxxxxxx> > > Signed-off-by: Hao Luo <haoluo@xxxxxxxxxx> > > --- [...] > > + * > > + * Currently only one session is supported, which means, depending on the > > + * volume of data bpf program intends to send to user space, the number > > + * of cgroups that can be walked is limited. For example, given the current > > + * buffer size is 8 * PAGE_SIZE, if the program sends 64B data for each > > + * cgroup, assuming PAGE_SIZE is 4kb, the total number of cgroups that can > > + * be walked is 512. This is a limitation of cgroup_iter. If the output data > > + * is larger than the buffer size, the second read() will signal EOPNOTSUPP. > > + * In order to work around, the user may have to update their program to > > same here as above for better description. > SG. Will update. > > + * reduce the volume of data sent to output. For example, skip some > > + * uninteresting cgroups. > > + */ > > + > > +struct bpf_iter__cgroup { > > + __bpf_md_ptr(struct bpf_iter_meta *, meta); > > + __bpf_md_ptr(struct cgroup *, cgroup); > > +}; > > + > > +struct cgroup_iter_priv { > > + struct cgroup_subsys_state *start_css; > > + bool visited_all; > > + bool terminate; > > + int order; > > +}; > > + > > +static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos) > > +{ > > + struct cgroup_iter_priv *p = seq->private; > > + > > + mutex_lock(&cgroup_mutex); > > + > > + /* cgroup_iter doesn't support read across multiple sessions. */ > > + if (*pos > 0) { > > + if (p->visited_all) > > + return NULL; > > This looks good. thanks! > > > + > > + /* Haven't visited all, but because cgroup_mutex has dropped, > > + * return -EOPNOTSUPP to indicate incomplete iteration. > > + */ > > + return ERR_PTR(-EOPNOTSUPP); > > + } > > + > > + ++*pos; > > + p->terminate = false; > > + p->visited_all = false; > > + if (p->order == BPF_ITER_CGROUP_PRE) > > + return css_next_descendant_pre(NULL, p->start_css); > > + else if (p->order == BPF_ITER_CGROUP_POST) > > + return css_next_descendant_post(NULL, p->start_css); > > + else /* BPF_ITER_CGROUP_PARENT_UP */ > > + return p->start_css; > > +} > > + > [...]