* Wen Congyang (wency@xxxxxxxxxxxxxx) wrote: > On 05/29/2015 04:42 PM, Dr. David Alan Gilbert wrote: > > * zhanghailiang (zhang.zhanghailiang@xxxxxxxxxx) wrote: > >> On 2015/5/29 9:29, Wen Congyang wrote: > >>> On 05/29/2015 12:24 AM, Dr. David Alan Gilbert wrote: > >>>> * zhanghailiang (zhang.zhanghailiang@xxxxxxxxxx) wrote: <snip> > >>>> The colo-proxy rcu problem I hit shows as rcu-stalls in both primary and secondary > >>>> after the qemu quits; the backtrace of the qemu stack is: > >>> > >>> How to reproduce it? Use monitor command quit to quit qemu? Or kill the qemu? > >>> > >>>> > >>>> [<ffffffff810d8c0c>] wait_rcu_gp+0x5c/0x80 > >>>> [<ffffffff810ddb05>] synchronize_rcu+0x45/0xd0 > >>>> [<ffffffffa0a251e5>] colo_node_release+0x35/0x50 [nfnetlink_colo] > >>>> [<ffffffffa0a25795>] colonl_close_event+0xe5/0x160 [nfnetlink_colo] > >>>> [<ffffffff81090c96>] notifier_call_chain+0x66/0x90 > >>>> [<ffffffff8109154c>] atomic_notifier_call_chain+0x6c/0x110 > >>>> [<ffffffff815eee07>] netlink_release+0x5b7/0x7f0 > >>>> [<ffffffff815878bf>] sock_release+0x1f/0x90 > >>>> [<ffffffff81587942>] sock_close+0x12/0x20 > >>>> [<ffffffff812193c3>] __fput+0xd3/0x210 > >>>> [<ffffffff8121954e>] ____fput+0xe/0x10 > >>>> [<ffffffff8108d9f7>] task_work_run+0xb7/0xf0 > >>>> [<ffffffff81002d4d>] do_notify_resume+0x8d/0xa0 > >>>> [<ffffffff81722b66>] int_signal+0x12/0x17 > >>>> [<ffffffffffffffff>] 0xffffffffffffffff > >>> > >>> Thanks for your test. The backtrace is very useful, and we will fix it soon. > >>> > >> > >> Yes, it is a bug, the callback function colonl_close_event() is called when holding > >> rcu lock: > >> netlink_release > >> ->atomic_notifier_call_chain > >> ->rcu_read_lock(); > >> ->notifier_call_chain > >> ->ret = nb->notifier_call(nb, val, v); > >> And here it is wrong to call synchronize_rcu which will lead to sleep. > >> Besides, there is another function might lead to sleep, kthread_stop which is called > >> in destroy_notify_cb. > >> > >>>> > >>>> that's with both the 423a8e268acbe3e644a16c15bc79603cfe9eb084 from yesterday and > >>>> older e58e5152b74945871b00a88164901c0d46e6365e tags on colo-proxy. > >>>> I'm not sure of the right fix; perhaps it might be possible to replace the > >>>> synchronize_rcu in colo_node_release by a call_rcu that does the kfree later? > >>> > >>> I agree with it. > >> > >> That is a good solution, i will fix both of the above problems. > > > > Thanks, > > We have fix this problem, and test it. The patch is pushed to github, please try it. Yes, that works. Thank you very much for the quick fix. Dave > > Thanks > Wen Congyang > > > > > Dave > > > >> > >> Thanks, > >> zhanghailiang > >> > >>> > >>>> > >>>> Thanks, > >>>> > >>>> Dave > >>>> > >>>>> > >>> > >>> > >>> . > >>> > >> > >> > > -- > > Dr. David Alan Gilbert / dgilbert@xxxxxxxxxx / Manchester, UK > > -- > > To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > . > > > -- Dr. David Alan Gilbert / dgilbert@xxxxxxxxxx / Manchester, UK -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html