On Fri, Nov 3, 2023 at 8:13 PM Waiman Long <longman@xxxxxxxxxx> wrote: > > The current design of cgroup_rstat_cpu_pop_updated() is to traverse > the updated tree in a way to pop out the leaf nodes first before > their parents. This can cause traversal of multiple nodes before a > leaf node can be found and popped out. IOW, a given node in the tree > can be visited multiple times before the whole operation is done. So > it is not very efficient and the code can be hard to read. > > With the introduction of cgroup_rstat_updated_list() to build a list > of cgroups to be flushed first before any flushing operation is being > done, we can optimize the way the updated tree nodes are being popped > by pushing the parents first to the tail end of the list before their > children. In this way, most updated tree nodes will be visited only > once with the exception of the subtree root as we still need to go > back to its parent and popped it out of its updated_children list. > This also makes the code easier to read. > > A parallel kernel build on a 2-socket x86-64 server is used as the > benchmarking tool for measuring the lock hold time. Below were the lock > hold time frequency distribution before and after the patch: > > Hold time Before patch After patch > --------- ------------ ----------- > 0-01 us 13,738,708 14,594,545 > 01-05 us 1,177,194 439,926 > 05-10 us 4,984 5,960 > 10-15 us 3,562 3,543 > 15-20 us 1,314 1,397 > 20-25 us 18 25 > 25-30 us 12 12 > > It can be seen that the patch pushes the lock hold time towards the > lower end. > > Signed-off-by: Waiman Long <longman@xxxxxxxxxx> > --- I don't know why git decided to show this diff in the most confusing way possible. > kernel/cgroup/rstat.c | 132 ++++++++++++++++++++++-------------------- > 1 file changed, 70 insertions(+), 62 deletions(-) > > diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c > index 1f300bf4dc40..d2b709cfeb2a 100644 > --- a/kernel/cgroup/rstat.c > +++ b/kernel/cgroup/rstat.c > @@ -74,64 +74,90 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) > } > > /** > - * cgroup_rstat_cpu_pop_updated - iterate and dismantle rstat_cpu updated tree > - * @pos: current position > - * @root: root of the tree to traversal > + * cgroup_rstat_push_children - push children cgroups into the given list > + * @head: current head of the list (= parent cgroup) > + * @prstatc: cgroup_rstat_cpu of the parent cgroup > * @cpu: target cpu > + * Return: A new singly linked list of cgroups to be flush > * > - * Walks the updated rstat_cpu tree on @cpu from @root. %NULL @pos starts > - * the traversal and %NULL return indicates the end. During traversal, > - * each returned cgroup is unlinked from the tree. Must be called with the > - * matching cgroup_rstat_cpu_lock held. > + * Recursively traverse down the cgroup_rstat_cpu updated tree and push > + * parent first before its children. The parent is pushed by the caller. I think it might be useful here (and elsewhere in the patch) where "push" is being used to elaborate that we push to the beginning in a stack-like fashion. > + * The recursion depth is the depth of the current updated tree. > + */ > +static struct cgroup *cgroup_rstat_push_children(struct cgroup *head, > + struct cgroup_rstat_cpu *prstatc, int cpu) > +{ > + struct cgroup *child, *parent; > + struct cgroup_rstat_cpu *crstatc; > + > + parent = head; > + child = prstatc->updated_children; > + prstatc->updated_children = parent; > + > + /* updated_next is parent cgroup terminated */ > + while (child != parent) { > + child->rstat_flush_next = head; > + head = child; > + crstatc = cgroup_rstat_cpu(child, cpu); > + if (crstatc->updated_children != parent) I think cgroup->updated_children is set to the cgroup itself if it's empty, right? Shouldn't this be crstatc->updated_children != child? > + head = cgroup_rstat_push_children(head, crstatc, cpu); > + child = crstatc->updated_next; > + crstatc->updated_next = NULL; > + } > + return head; > +} > + > +/** > + * cgroup_rstat_updated_list - return a list of updated cgroups to be flushed > + * @root: root of the cgroup subtree to traverse > + * @cpu: target cpu > + * Return: A singly linked list of cgroups to be flushed > + * > + * Walks the updated rstat_cpu tree on @cpu from @root. During traversal, > + * each returned cgroup is unlinked from the updated tree. Must be called > + * with the matching cgroup_rstat_cpu_lock held. This function takes care of holding the lock actually. I think that sentence should be applied to cgroup_rstat_push_children() above? > * > * The only ordering guarantee is that, for a parent and a child pair > - * covered by a given traversal, if a child is visited, its parent is > - * guaranteed to be visited afterwards. > + * covered by a given traversal, the child is before its parent in > + * the list. > + * > + * Note that updated_children is self terminated while updated_next is > + * parent cgroup terminated except the cgroup root which can be self > + * terminated. IIUC updated_children and updated_next is the same list. updated_children is the head, and updated_next is how the list items are linked. This comment makes it seem like they are two different lists. I am actually wondering if it's worth using the singly linked list here. We are saving 8 bytes percpu, but the semantics are fairly confusing. Wouldn't this be easier to reason about if you just use list_head? updated_children would be replaced with LIST_HEAD (or similar), and the list would be NULL terminated instead of terminated by self/parent cgroup. IIUC the reason it's not NULL-terminated now is because we use cgroup->updated_next to check quickly if a cgroup is on the list or not. If we use list_heads, we can just use list_emtpy() IIUC. We can also simplify the semantics of unlinking @root from the updated tree below, it would just be list_del() IIUC, which is actually more performant as well. It seems like overall we would simplify a lot of things. When forming the updated_list, we can just walk the tree and splice the lists in the correct order. It seems to me that saving 8 bytes percpu is not worth the complexity of the custom list semantics here. Am I missing something here?