Hello, Joel! > Hello, Vlad, > > Looks like a nice patch, I am still looking at this more and just > started, but a few small comments: > Thank you for looking at it :) And thank you for the comments. > On Mon, Oct 16, 2023 at 1:30 PM Uladzislau Rezki (Sony) > <urezki@xxxxxxxxx> wrote: > > > > A call to a synchronize_rcu() can be optimized from time point of > > view. Different workloads can be affected by this especially the > > ones which use this API in its time critical sections. > > > > For example if CONFIG_RCU_NOCB_CPU is set, the wakeme_after_rcu() > > callback can be delayed and such delay depends on where in a nocb > > list it is located. > > > > 1. On our Android devices i can easily trigger the scenario when > > it is a last in the list out of ~3600 callbacks: > > > > <snip> > > <...>-29 [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28 > > ... > > <...>-29 [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt > > <...>-29 [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt > > <...>-29 [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt > > <...>-29 [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt > > <...>-29 [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt > > <...>-29 [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt > > <...>-29 [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=.... > > <snip> > > > > 2. We use cpuset/cgroup to classify tasks and assign them into > > different cgroups. For example "backgrond" group which binds tasks > > only to little CPUs or "foreground" which makes use of all CPUs. > > Tasks can be migrated between groups by a request if an acceleration > > is needed. > > > > See below an example how "surfaceflinger" task gets migrated. > > Initially it is located in the "system-background" cgroup which > > allows to run only on little cores. In order to speed it up it > > can be temporary moved into "foreground" cgroup which allows > > to use big/all CPUs: > > > [...] > > 3. To address this drawback, maintain a separate track that consists > > of synchronize_rcu() callers only. The GP-kthread, that drivers a GP > > either wake-ups a worker to drain all list or directly wakes-up end > > user if it is one in the drain list. > > I wonder if there is a cut-off N, where waking up N ("a few") inline > instead of just 1, yields similar results. For small values of N, that > keeps the total number of wakeups lower than pushing off to a kworker. > For instance, if 2 users, then you get 3 wakeups instead of just 2 (1 > for the kworker and another 2 for the synchronize). But if you had a > cut off like N=5, then 2 users results in just 2 wakeups. > I don't feel too strongly about it, but for small values of N, I am > interested in a slightly better optimization if we can squeeze it. > This makes sense to me. We can add a threshold to wake-up several users. Like you mentioned. We can do 2 wake-ups instead of 3. > [...] > > + * There are three lists for handling synchronize_rcu() users. > > + * A first list corresponds to new coming users, second for users > > + * which wait for a grace period and third is for which a grace > > + * period is passed. > > + */ > > +static struct sr_normal_state { > > + struct llist_head curr; /* request a GP users. */ > > + struct llist_head wait; /* wait for GP users. */ > > + struct llist_head done; /* ready for GP users. */ > > + struct llist_node *curr_tail; > > + struct llist_node *wait_tail; > > Just for clarification, the only reason you need 'tail' is because you > can use llist_add_batch() to quickly splice the list, instead of > finding the tail each time (expensive for a large list), correct? It > would be good to mention that in a comment here. > Right. I do not want to scan the list. I will comment it. > > + atomic_t active; > > +} sr; > > + > > +/* Disable it by default. */ > > +static int rcu_normal_wake_from_gp = 0; > > +module_param(rcu_normal_wake_from_gp, int, 0644); > > + > > +static void rcu_sr_normal_complete(struct llist_node *node) > > +{ > > + struct rcu_synchronize *rs = container_of( > > + (struct rcu_head *) node, struct rcu_synchronize, head); > > + unsigned long oldstate = (unsigned long) rs->head.func; > > + > > + if (!poll_state_synchronize_rcu(oldstate)) > > + WARN_ONCE(1, "A full grace period is not passed yet: %lu", > > + rcu_seq_diff(get_state_synchronize_rcu(), oldstate)); > > nit: WARN_ONCE(!poll_state_synchronize_rcu(oldstate)), ...) and get > rid of if() ? > Initially i had it written as you proposed i reworked it because i wanted to see the delta. I do not have a strong opinion, so i can fix it as you proposed. > > > + > > + /* Finally. */ > > + complete(&rs->completion); > > +} > > + > > +static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work) > > +{ > > + struct llist_node *done, *rcu, *next; > > + > > + done = llist_del_all(&sr.done); > > + if (!done) > > + return; > > + > > + llist_for_each_safe(rcu, next, done) > > + rcu_sr_normal_complete(rcu); > > +} > [...] > > +static void rcu_sr_normal_add_req(struct rcu_synchronize *rs) > > +{ > > + atomic_inc(&sr.active); > > + if (llist_add((struct llist_node *) &rs->head, &sr.curr)) > > + /* Set the tail. Only first and one user can do that. */ > > + WRITE_ONCE(sr.curr_tail, (struct llist_node *) &rs->head); > > + atomic_dec(&sr.active); > > Here there is no memory ordering provided by the atomic ops. Is that really Ok? > This needs to be reworked since there is no ordering guaranteed. I think there is a version of "atomic_inc_something" that guarantees it? > And same question for other usages of .active. > > Per: https://www.kernel.org/doc/Documentation/atomic_t.txt > "RMW operations that have no return value are unordered;" > -- > Note to self to ping my Android friends as well about this improvement > :-). Especially the -mm Android folks are interested in app startup > time. > That is good. Thank you :) -- Uladzislau Rezki