On Mon, Nov 6, 2023 at 12:37 PM Waiman Long <longman@xxxxxxxxxx> wrote: > > On 11/6/23 15:07, Yosry Ahmed wrote: > > On Fri, Nov 3, 2023 at 8:13 PM Waiman Long <longman@xxxxxxxxxx> wrote: > >> The current design of cgroup_rstat_cpu_pop_updated() is to traverse > >> the updated tree in a way to pop out the leaf nodes first before > >> their parents. This can cause traversal of multiple nodes before a > >> leaf node can be found and popped out. IOW, a given node in the tree > >> can be visited multiple times before the whole operation is done. So > >> it is not very efficient and the code can be hard to read. > >> > >> With the introduction of cgroup_rstat_updated_list() to build a list > >> of cgroups to be flushed first before any flushing operation is being > >> done, we can optimize the way the updated tree nodes are being popped > >> by pushing the parents first to the tail end of the list before their > >> children. In this way, most updated tree nodes will be visited only > >> once with the exception of the subtree root as we still need to go > >> back to its parent and popped it out of its updated_children list. > >> This also makes the code easier to read. > >> > >> A parallel kernel build on a 2-socket x86-64 server is used as the > >> benchmarking tool for measuring the lock hold time. Below were the lock > >> hold time frequency distribution before and after the patch: > >> > >> Hold time Before patch After patch > >> --------- ------------ ----------- > >> 0-01 us 13,738,708 14,594,545 > >> 01-05 us 1,177,194 439,926 > >> 05-10 us 4,984 5,960 > >> 10-15 us 3,562 3,543 > >> 15-20 us 1,314 1,397 > >> 20-25 us 18 25 > >> 25-30 us 12 12 > >> > >> It can be seen that the patch pushes the lock hold time towards the > >> lower end. > >> > >> Signed-off-by: Waiman Long <longman@xxxxxxxxxx> > >> --- > > I don't know why git decided to show this diff in the most confusing > > way possible. > I agree. The diff is really hard to look at. It will be easier to apply > the patch & looks at the actual rstat.c file. Would the diff be simpler if patches 1 & 2 were squashed? [..] > > > >> * > >> * The only ordering guarantee is that, for a parent and a child pair > >> - * covered by a given traversal, if a child is visited, its parent is > >> - * guaranteed to be visited afterwards. > >> + * covered by a given traversal, the child is before its parent in > >> + * the list. > >> + * > >> + * Note that updated_children is self terminated while updated_next is > >> + * parent cgroup terminated except the cgroup root which can be self > >> + * terminated. > > IIUC updated_children and updated_next is the same list. > > updated_children is the head, and updated_next is how the list items > > are linked. This comment makes it seem like they are two different > > lists. > Thanks for the comment. I will rework the comment to clarify that a bit > more. > > > > I am actually wondering if it's worth using the singly linked list > > here. We are saving 8 bytes percpu, but the semantics are fairly > > confusing. Wouldn't this be easier to reason about if you just use > > list_head? > > > > updated_children would be replaced with LIST_HEAD (or similar), and > > the list would be NULL terminated instead of terminated by self/parent > > cgroup. IIUC the reason it's not NULL-terminated now is because we use > > cgroup->updated_next to check quickly if a cgroup is on the list or > > not. If we use list_heads, we can just use list_emtpy() IIUC. > > > > We can also simplify the semantics of unlinking @root from the updated > > tree below, it would just be list_del() IIUC, which is actually more > > performant as well. It seems like overall we would simplify a lot of > > things. When forming the updated_list, we can just walk the tree and > > splice the lists in the correct order. > > > > It seems to me that saving 8 bytes percpu is not worth the complexity > > of the custom list semantics here. Am I missing something here? > > It will cost an additional 16 bytes of percpu memory if converted to > list_heads. Like other lists, there will be sibling and children > list_heads. There are also 2 pointers to update instead of one. Anyway, > I don't have an objection to convert them to list_heads if agreed by Tejun. Yes you are right. It's definitely not free, but it's also not super costly. It's just that every time I look at the rstat code I need to remind myself of how updated_next and updated_children work. I will let Tejun decide.