Hello, Ziwei Dai! > Hi Uladzislau and all, > > Sorry for the disclaimer in the original mail. > Please help comment in this new thread. > > We found this issue at K5.15. We try to fix this issue on K5.15. > It seems mainline also has this issue. > > Below is the first debug patch on k5.15 device, which is under stress test, issue not reproduce so far. > ============================================================ > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index 66951e130c2fc..44759641f7234 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -3342,15 +3342,21 @@ static void kfree_rcu_monitor(struct work_struct *work) > // Attempt to start a new batch. > for (i = 0; i < KFREE_N_BATCHES; i++) { > struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]); > + bool rcu_work_pending; > > // Try to detach bkvhead or head and attach it over any > // available corresponding free channel. It can be that > // a previous RCU batch is in progress, it means that > // immediately to queue another one is not possible so > // in that case the monitor work is rearmed. > - if ((krcp->bkvhead[0] && !krwp->bkvhead_free[0]) || > - (krcp->bkvhead[1] && !krwp->bkvhead_free[1]) || > - (krcp->head && !krwp->head_free)) { > + rcu_work_pending = test_bit( > + WORK_STRUCT_PENDING_BIT, > + work_data_bits(&krwp->rcu_work.work)); > + // If there is on-going rcu work, continue. > + if (rcu_work_pending || krwp->bkvhead_free[0] || > + krwp->bkvhead_free[1] || krwp->head_free) > + continue; > + if (krcp->bkvhead[0] || krcp->bkvhead[1] || krcp->head) { > // Channel 1 corresponds to the SLAB-pointer bulk path. > // Channel 2 corresponds to vmalloc-pointer bulk path. > for (j = 0; j < FREE_N_CHANNELS; j++) { > > As " rcu_work_pending" judgement seems redundant, I made the second patch below on k5.15. We will make stress test. > ============================================================ > Below is the first debug patch on k5.15 device, which is under stress test, issue not reproduce so far. > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index 66951e130c2fc..f219c60a8ec30 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -3343,14 +3343,13 @@ static void kfree_rcu_monitor(struct work_struct *work) > for (i = 0; i < KFREE_N_BATCHES; i++) { > struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]); > > - // Try to detach bkvhead or head and attach it over any > - // available corresponding free channel. It can be that > - // a previous RCU batch is in progress, it means that > - // immediately to queue another one is not possible so > - // in that case the monitor work is rearmed. > - if ((krcp->bkvhead[0] && !krwp->bkvhead_free[0]) || > - (krcp->bkvhead[1] && !krwp->bkvhead_free[1]) || > - (krcp->head && !krwp->head_free)) { > + // Try to detach bulk_head or head and attach it, only when > + // all channels are free. Any channel is not free means at krwp > + // there is on-going rcu work to handle krwp's free business. > + if (krwp->bkvhead_free[0] || krwp->bkvhead_free[1] || > + krwp->head_free) > + continue; > + if (krcp->bkvhead[0] || krcp->bkvhead[1] || krcp->head) { > // Channel 1 corresponds to the SLAB-pointer bulk path. > // Channel 2 corresponds to vmalloc-pointer bulk path. > for (j = 0; j < FREE_N_CHANNELS; j++) { > > > > -----邮件原件----- > > 发件人: 代子为 (Ziwei Dai) <Ziwei.Dai@xxxxxxxxxx> > > 发送时间: 2023年3月30日 17:27 > > 收件人: paulmck@xxxxxxxxxx; frederic@xxxxxxxxxx; > > quic_neeraju@xxxxxxxxxxx; josh@xxxxxxxxxxxxxxxx; rostedt@xxxxxxxxxxx; > > mathieu.desnoyers@xxxxxxxxxxxx; jiangshanlai@xxxxxxxxx; > > joel@xxxxxxxxxxxxxxxxx; rcu@xxxxxxxxxxxxxxx > > 抄送: linux-kernel@xxxxxxxxxxxxxxx; 王双 (Shuang Wang) > > <shuang.wang@xxxxxxxxxx>; 辛依凡 (Yifan Xin) <Yifan.Xin@xxxxxxxxxx>; > > 王科 (Ke Wang) <Ke.Wang@xxxxxxxxxx>; 闫学文 (Xuewen Yan) > > <Xuewen.Yan@xxxxxxxxxx>; 牛志国 (Zhiguo Niu) <Zhiguo.Niu@xxxxxxxxxx>; > > 代子为 (Ziwei Dai) <Ziwei.Dai@xxxxxxxxxx>; 黄朝阳 (Zhaoyang Huang) > > <zhaoyang.huang@xxxxxxxxxx> > > 主题: [PATCH] rcu: Make sure new krcp free business is handled after the > > wanted rcu grace period. > > > > From: 代子为 (Ziwei Dai) <ziwei.dai@xxxxxxxxxxxxxxxxxxxxxxxxxxx> > > > > In kfree_rcu_monitor(), new free business at krcp is attached to any free > > channel at krwp. kfree_rcu_monitor() is responsible to make sure new free > > business is handled after the rcu grace period. But if there is any none-free > > channel at krwp already, that means there is an on-going rcu work, which will > > cause the kvfree_call_rcu()-triggered free business is done before the wanted > > rcu grace period ends. > > > > This commit ignores krwp which has non-free channel at kfree_rcu_monitor(), > > to fix the issue that kvfree_call_rcu() loses effectiveness. > > > > Below is the css_set obj "from_cset" use-after-free issue caused by > > kvfree_call_rcu() losing effectiveness. > > Core 0 calls rcu_read_lock(), then use "from_cset", then hard irq comes. > > Core 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new > > gp. > > Core 2 frees "from_cset" after current gp end. "from_cset" is reallocated. > > Core 0 references "from_cset"'s member, which causes crash. > > > > Core 0 Core 1 Core 2 > > count_memcg_event_mm() > > |rcu_read_lock() <--- > > |mem_cgroup_from_task() > > |// <css_set ptr> is the "from_cset" mentioned on core 1 |<css_set ptr> = > > rcu_dereference((task)->cgroups) |// Hard irq comes, current task is > > scheduled out. > > > > Core 1: > > cgroup_attach_task() > > |cgroup_migrate() > > |cgroup_migrate_execute() > > |css_set_move_task(task, from_cset, to_cset, true) > > |cgroup_move_task(task, to_cset) > > |rcu_assign_pointer(.., to_cset) > > |... > > |cgroup_migrate_finish() > > |put_css_set_locked(from_cset) > > |from_cset->refcount return 0 > > |kfree_rcu(cset, rcu_head) <--- means to free from_cset > > after new gp > > |add_ptr_to_bulk_krc_lock() > > |schedule_delayed_work(&krcp->monitor_work, ..) > > > > kfree_rcu_monitor() > > |krcp->bulk_head[0]'s work attached to > > krwp->bulk_head_free[] > > |queue_rcu_work(system_wq, &krwp->rcu_work) > > |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT > > state, > > |call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request a new gp > > > > // There is a perious call_rcu(.., > > rcu_work_rcufn) > > // gp end, rcu_work_rcufn() is called. > > rcu_work_rcufn() > > |__queue_work(.., rwork->wq, > > &rwork->work); > > Core 2: > > // or there is a pending > > kfree_rcu_work() work called. > > |kfree_rcu_work() > > |krwp->bulk_head_free[0] bulk is > > freed before new gp end!!! > > |The "from_cset" mentioned on core > > 1 is freed before new gp end. > > Core 0: > > // the task is schedule in after many ms. > > |<css_set ptr>->subsys[(subsys_id) <--- caused kernel crash, because > > <css_set ptr>="from_cset" is freed. > > > > Signed-off-by: Ziwei Dai <ziwei.dai@xxxxxxxxxx> > > > > :# modified: tree.c > > --- > > kernel/rcu/tree.c | 19 ++++++++++--------- > > 1 file changed, 10 insertions(+), 9 deletions(-) > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 8e880c0..f6451a8 > > 100644 > > --- a/kernel/rcu/tree.c > > +++ b/kernel/rcu/tree.c > > @@ -3107,15 +3107,16 @@ static void kfree_rcu_monitor(struct > > work_struct *work) > > for (i = 0; i < KFREE_N_BATCHES; i++) { > > struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]); > > > > - // Try to detach bulk_head or head and attach it over any > > - // available corresponding free channel. It can be that > > - // a previous RCU batch is in progress, it means that > > - // immediately to queue another one is not possible so > > - // in that case the monitor work is rearmed. > > - if ((!list_empty(&krcp->bulk_head[0]) && > > list_empty(&krwp->bulk_head_free[0])) || > > - (!list_empty(&krcp->bulk_head[1]) && > > list_empty(&krwp->bulk_head_free[1])) || > > - (READ_ONCE(krcp->head) && !krwp->head_free)) { > > - > > + // Try to detach bulk_head or head and attach it, only when > > + // all channels are free. Any channel is not free means at krwp > > + // there is on-going rcu work to handle krwp's free business. > > + if (!list_empty(&krwp->bulk_head_free[0]) || > > + !list_empty(&krwp->bulk_head_free[1]) || > > + krwp->head_free) > > + continue; Can we replace it with a new helper, for example: +static bool +can_krcw_offload(struct kfree_rcu_cpu_work *krwp) +{ + int i; + + for (i = 0; i < FREE_N_CHANNELS; i++) + if (!list_empty(&krwp->bulk_head_free[i])) + return false; + + return (READ_ONCE(krwp->head_free) == NULL); +} and then + if (!can_krcw_offload(krwp)) continue; > > > + if (!list_empty(&krcp->bulk_head[0]) || > > + !list_empty(&krcp->bulk_head[1]) || > > + READ_ONCE(krcp->head)) { > Can we replace it with the: + // kvfree_rcu_drain_ready() might handle this krcp, if so give up. + if (need_offload_krc(krcp)) { ? because we have a helper that is need_offload_krcp() that does the same. > > // Channel 1 corresponds to the SLAB-pointer bulk path. > > // Channel 2 corresponds to vmalloc-pointer bulk path. > > for (j = 0; j < FREE_N_CHANNELS; j++) { > > -- > > 1.9.1 > Also we need to add Fixes tag, you say you are on 5.15 kernel. So something was added there. I will try to double check. And thank you for fixing it!!! -- Uladzislau Rezki