Re: [PATCH v3 2/3] cgroup/rstat: Optimize cgroup_rstat_updated_list()

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Mon, 6 Nov 2023 13:17:14 -0800

On Mon, Nov 6, 2023 at 12:37 PM Waiman Long <longman@xxxxxxxxxx> wrote:
>
> On 11/6/23 15:07, Yosry Ahmed wrote:
> > On Fri, Nov 3, 2023 at 8:13 PM Waiman Long <longman@xxxxxxxxxx> wrote:
> >> The current design of cgroup_rstat_cpu_pop_updated() is to traverse
> >> the updated tree in a way to pop out the leaf nodes first before
> >> their parents. This can cause traversal of multiple nodes before a
> >> leaf node can be found and popped out. IOW, a given node in the tree
> >> can be visited multiple times before the whole operation is done. So
> >> it is not very efficient and the code can be hard to read.
> >>
> >> With the introduction of cgroup_rstat_updated_list() to build a list
> >> of cgroups to be flushed first before any flushing operation is being
> >> done, we can optimize the way the updated tree nodes are being popped
> >> by pushing the parents first to the tail end of the list before their
> >> children. In this way, most updated tree nodes will be visited only
> >> once with the exception of the subtree root as we still need to go
> >> back to its parent and popped it out of its updated_children list.
> >> This also makes the code easier to read.
> >>
> >> A parallel kernel build on a 2-socket x86-64 server is used as the
> >> benchmarking tool for measuring the lock hold time. Below were the lock
> >> hold time frequency distribution before and after the patch:
> >>
> >>       Hold time        Before patch       After patch
> >>       ---------        ------------       -----------
> >>         0-01 us        13,738,708         14,594,545
> >>        01-05 us         1,177,194            439,926
> >>        05-10 us             4,984              5,960
> >>        10-15 us             3,562              3,543
> >>        15-20 us             1,314              1,397
> >>        20-25 us                18                 25
> >>        25-30 us                12                 12
> >>
> >> It can be seen that the patch pushes the lock hold time towards the
> >> lower end.
> >>
> >> Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
> >> ---
> > I don't know why git decided to show this diff in the most confusing
> > way possible.
> I agree. The diff is really hard to look at. It will be easier to apply
> the patch & looks at the actual rstat.c file.

Would the diff be simpler if patches 1 & 2 were squashed?

[..]
> >
> >>    *
> >>    * The only ordering guarantee is that, for a parent and a child pair
> >> - * covered by a given traversal, if a child is visited, its parent is
> >> - * guaranteed to be visited afterwards.
> >> + * covered by a given traversal, the child is before its parent in
> >> + * the list.
> >> + *
> >> + * Note that updated_children is self terminated while updated_next is
> >> + * parent cgroup terminated except the cgroup root which can be self
> >> + * terminated.
> > IIUC updated_children and updated_next is the same list.
> > updated_children is the head, and updated_next is how the list items
> > are linked. This comment makes it seem like they are two different
> > lists.
> Thanks for the comment. I will rework the comment to clarify that a bit
> more.
> >
> > I am actually wondering if it's worth using the singly linked list
> > here. We are saving 8 bytes percpu, but the semantics are fairly
> > confusing. Wouldn't this be easier to reason about if you just use
> > list_head?
> >
> > updated_children would be replaced with LIST_HEAD (or similar), and
> > the list would be NULL terminated instead of terminated by self/parent
> > cgroup. IIUC the reason it's not NULL-terminated now is because we use
> > cgroup->updated_next to check quickly if a cgroup is on the list or
> > not. If we use list_heads, we can just use list_emtpy() IIUC.
> >
> > We can also simplify the semantics of unlinking @root from the updated
> > tree below, it would just be list_del() IIUC, which is actually more
> > performant as well. It seems like overall we would simplify a lot of
> > things. When forming the updated_list, we can just walk the tree and
> > splice the lists in the correct order.
> >
> > It seems to me that saving 8 bytes percpu is not worth the complexity
> > of the custom list semantics here. Am I missing something here?
>
> It will cost an additional 16 bytes of percpu memory if converted to
> list_heads. Like other lists, there will be sibling and children
> list_heads. There are also 2 pointers to update instead of one. Anyway,
> I don't have an objection to convert them to list_heads if agreed by Tejun.

Yes you are right. It's definitely not free, but it's also not super
costly. It's just that every time I look at the rstat code I need to
remind myself of how updated_next and updated_children work. I will
let Tejun decide.