On Thu, Jul 21, 2022 at 11:16 AM Yonghong Song <yhs@xxxxxx> wrote: > > > > On 7/21/22 10:21 AM, Hao Luo wrote: > > On Thu, Jul 21, 2022 at 9:15 AM Yonghong Song <yhs@xxxxxx> wrote: > >> > >> > >> > >> On 7/20/22 5:40 PM, Hao Luo wrote: > >>> On Mon, Jul 11, 2022 at 8:45 PM Yonghong Song <yhs@xxxxxx> wrote: > >>>> > >>>> On 7/11/22 5:42 PM, Hao Luo wrote: > >>> [...] > >>>>>>>> + > >>>>>>>> +static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos) > >>>>>>>> +{ > >>>>>>>> + struct cgroup_iter_priv *p = seq->private; > >>>>>>>> + > >>>>>>>> + mutex_lock(&cgroup_mutex); > >>>>>>>> + > >>>>>>>> + /* support only one session */ > >>>>>>>> + if (*pos > 0) > >>>>>>>> + return NULL; > >>>>>>> > >>>>>>> This might be okay. But want to check what is > >>>>>>> the practical upper limit for cgroups in a system > >>>>>>> and whether we may miss some cgroups. If this > >>>>>>> happens, it will be a surprise to the user. > >>>>>>> > >>>>> > >>>>> Ok. What's the max number of items supported in a single session? > >>>> > >>>> The max number of items (cgroups) in a single session is determined > >>>> by kernel_buffer_size which equals to 8 * PAGE_SIZE. So it really > >>>> depends on how much data bpf program intends to send to user space. > >>>> If each bpf program run intends to send 64B to user space, e.g., for > >>>> cpu, memory, cpu pressure, mem pressure, io pressure, read rate, write > >>>> rate, read/write rate. Then each session can support 512 cgroups. > >>>> > >>> > >>> Hi Yonghong, > >>> > >>> Sorry about the late reply. It's possible that the number of cgroup > >>> can be large, 1000+, in our production environment. But that may not > >>> be common. Would it be good to leave handling large number of cgroups > >>> as follow up for this patch? If it turns out to be a problem, to > >>> alleviate it, we could: > >>> > >>> 1. tell users to write program to skip a certain uninteresting cgroups. > >>> 2. support requesting large kernel_buffer_size for bpf_iter, maybe as > >>> a new bpf_iter flag. > >> > >> Currently if we intend to support multiple read() for cgroup_iter, > >> the following is a very inefficient approach: > >> > >> in seq_file private data structure, remember the last cgroup visited > >> and for the second read() syscall, do the traversal again (but not > >> calling bpf program) until the last cgroup and proceed from there. > >> This is inefficient and probably works. But if the last cgroup is > >> gone from the hierarchy, that the above approach won't work. One > >> possibility is to remember the last two cgroups. If the last cgroup > >> is gone, check the 'next' cgroup based on the one before the last > >> cgroup. If both are gone, we return NULL. > >> > > > > I suspect in reality, just remembering the last cgroup (or two > > cgroups) may not be sufficient. First, I don't want to hold > > cgroup_mutex across multiple sessions. I assume it's also not safe to > > release cgroup_mutex in the middle of walking cgroup hierarchy. > > Supporting multiple read() can be nasty for cgroup_iter. > > Right, holding cgroup_mutex across sessions is not bad idea > and I didn't recommend it either. > > I am aware of remembering last (one or two) cgroups are not > 100% reliable. All other iters have similar issues w.r.t. > across multiple sessions. But the idea is to find a > *reasonable* start for the second and later session. > For example, for task related iter, the previous session > last tid can be remember and the next session starts > with next tid after last tid based on idr. Some old > processes might be gone, but we got a reasonable > approximation. But cgroup is different, holding > the cgroup pointer with an additional reference > across sessions is not good. but holding cgroup > id requires to traverse the cgroup hierarchy > again and this is not efficient. Maybe other people > has a better idea how to do this. > Makes sense. > > > >> But in any case, if there are additional cgroups not visited, > >> in the second read(), we should not return NULL which indicates > >> done with all cgroups. We may return EOPNOTSUPP to indicate there > >> are missing cgroups due to not supported. > >> > >> Once users see EOPNOTSUPP which indicates there are missing > >> cgroups, they can do more filtering in bpf program to avoid > >> large data volume to user space. > >> > > > > Makes sense. Yonghong, one question to confirm, if the first read() > > overflows, does the user still get partial data? > > Yes, the first read() and subsequent read()'s will be okay > until user space receives all 8KB data where 8KB is the > kernel buffer size. For example, if user provides 1KB buffer > size, the first 8 read() syscalls will return proper data > to user space. > > After that, read() should return > EOPNOTSUPP instead of return 0 where '0' indicates > no more data. > Sounds good. Will do that. > > > > > I'll change the return code to EOPNOTSUPP in v4 of this patchset. > > > >> To provide a way to truely visit *all* cgroups, > >> we can either use bpf_iter link_create->flags > >> to increase the buffer size as your suggested in the above so > >> user can try to allocate more kernel buffer size. Or implement > >> proper second read() traversal which I don't have a good idea > >> how to do it efficiently. > > > > I will try the buffer size increase first. Looks more doable. Do you > > mind putting this support as a follow-up? > > If we cannot finalize the solution to completely support > arbitrary user output for cgroup_iter, we need to support > EOPNOTSUPP so user knows it should adjust the bpf program > to have less data to user space through seq_file. For example, > they can put data into mmap-ed array map. Please explain > such a limitation and how to workaround this in commit > message clearly. > Acknowledged. I will put a comment in the code and also explain in the commit message. Thanks! > So yes, to support buffer size increase through link_create > flags or to support a better way to start iteration after 8KB > user data can be a followup. > > > > >>> > >>> Hao > >>> > >>>>> > >>> [...] > >>>>>>> [...]