Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`

Boris Zhmurov <bb@xxxxxxxxxxxxxx> · Mon, 28 Nov 2016 22:16:33 +0300

Paul E. McKenney 28/11/16 18:05:
> On Mon, Nov 28, 2016 at 05:40:48PM +0300, Boris Zhmurov wrote:
>> Paul E. McKenney 28/11/16 17:34:
>>
>>
>>>> So Paul, I've dropped "mm: Prevent shrink_node_memcg() RCU CPU stall
>>>> warnings" patch, and stalls got back (attached).
>>>>
>>>> With this patch "commit 7cebc6b63bf75db48cb19a94564c39294fd40959" from
>>>> your tree stalls gone. Looks like that.
>>>
>>> So with only this commit and no other commit or configuration adjustment,
>>> everything works?  Or it the solution this commit and some other stuff?
>>>
>>> The reason I ask is that if just this commit does the trick, I should
>>> drop the others.
>>
>> I'd like to ask for some more time to make sure this is it.
>> Approximately 2 or 3 days.
> 
> Works for me!
> 
> 							Thanx, Paul

FYI.
Some more stalls with mm-prevent-shrink_node-RCU-CPU-stall-warning.patch
and without mm-prevent-shrink_node_memcg-RCU-CPU-stall-warnings.patch.

-- 
Boris Zhmurov
System/Network Administrator
mailto: bb@xxxxxxxxxxxxxx
"wget http://kernelpanic.ru/bb_public_key.pgp -O - | gpg --import"
[26327.859412] INFO: rcu_sched detected stalls on CPUs/tasks:
[26327.859466] 	18-...: (39 ticks this GP) idle=1ed/140000000000000/0 softirq=3790251/3790251 fqs=24 
[26327.859529] 	(detected by 2, t=6429 jiffies, g=1258488, c=1258487, q=6044)
[26327.859583] Task dump for CPU 18:
[26327.859584] kswapd1         R  running task        0   148      2 0x00000008
[26327.859588]  ffff9e779f411400 ffff9e779096fe68 ffff9e8fffffc000 0000000000000000
[26327.859591]  ffffffffa592404d 0000000000000000 0000000000000000 0000000000000000
[26327.859593]  0000000000000000 ffff9e779096fe58 000000000170bf2c ffff9e8fffffc000
[26327.859596] Call Trace:
[26327.859604]  [<ffffffffa592404d>] ? shrink_node+0xcd/0x2f0
[26327.859606]  [<ffffffffa5924cca>] ? kswapd+0x2ba/0x5e0
[26327.859609]  [<ffffffffa5924a10>] ? mem_cgroup_shrink_node+0x90/0x90
[26327.859612]  [<ffffffffa587bce8>] ? kthread+0xb8/0xd0
[26327.859616]  [<ffffffffa5d1311f>] ? ret_from_fork+0x1f/0x40
[26327.859618]  [<ffffffffa587bc30>] ? kthread_create_on_node+0x170/0x170
[26351.132731] INFO: rcu_sched detected stalls on CPUs/tasks:
[26351.132778] 	(detected by 2, t=6432 jiffies, g=1258490, c=1258489, q=7476)
[26351.132835] All QSes seen, last rcu_sched kthread activity 1405 (4302782902-4302781497), jiffies_till_next_fqs=2, root ->qsmask 0x0
[26351.132917] mc:writer_9     R  running task        0 28495   2101 0x00000008
[26351.132921]  ffffffffa623e600 ffffffffa58b5337 0000000000000000 0000000000000000
[26351.132923]  0000000000001d34 ffffffffa623e600 ffffffffa58be772 ffff9e8c0e54f300
[26351.132925]  0000000000000000 ffff9e78027ffb08 000017f78e9681f9 0000000000000001
[26351.132928] Call Trace:
[26351.132929]  <IRQ>  [<ffffffffa58b5337>] ? rcu_check_callbacks+0x727/0x730
[26351.132939]  [<ffffffffa58be772>] ? update_wall_time+0x382/0x710
[26351.132942]  [<ffffffffa58b8093>] ? update_process_times+0x23/0x50
[26351.132947]  [<ffffffffa58c5bad>] ? tick_sched_handle.isra.15+0x2d/0x40
[26351.132949]  [<ffffffffa58c5bf3>] ? tick_sched_timer+0x33/0x60
[26351.132950]  [<ffffffffa58b879d>] ? __hrtimer_run_queues+0x9d/0x110
[26351.132952]  [<ffffffffa58b8cb4>] ? hrtimer_interrupt+0x94/0x190
[26351.132957]  [<ffffffffa5842b74>] ? smp_apic_timer_interrupt+0x34/0x50
[26351.132961]  [<ffffffffa5d13a82>] ? apic_timer_interrupt+0x82/0x90
[26351.132961]  <EOI>  [<ffffffffa5d12b2c>] ? _raw_spin_unlock_irqrestore+0xc/0x20
[26351.132968]  [<ffffffffa591d8fb>] ? pagevec_lru_move_fn+0xab/0xe0
[26351.132969]  [<ffffffffa591cee0>] ? SyS_readahead+0x90/0x90
[26351.132971]  [<ffffffffa591d9bc>] ? __lru_cache_add+0x4c/0x60
[26351.132974]  [<ffffffffa590efa9>] ? add_to_page_cache_lru+0x59/0xc0
[26351.132976]  [<ffffffffa590f89b>] ? pagecache_get_page+0xcb/0x240
[26351.132979]  [<ffffffffa591096d>] ? grab_cache_page_write_begin+0x1d/0x40
[26351.132998]  [<ffffffffc028c3db>] ? ext4_da_write_begin+0x9b/0x330 [ext4]
[26351.133000]  [<ffffffffa5910afe>] ? generic_perform_write+0xbe/0x1a0
[26351.133003]  [<ffffffffa5998126>] ? file_update_time+0x36/0xe0
[26351.133005]  [<ffffffffa59116b0>] ? __generic_file_write_iter+0x170/0x1d0
[26351.133012]  [<ffffffffc0281d4b>] ? ext4_file_write_iter+0x11b/0x320 [ext4]
[26351.133015]  [<ffffffffa588e4ae>] ? set_next_entity+0x6e/0x770
[26351.133017]  [<ffffffffa588d9ab>] ? put_prev_entity+0x5b/0x6f0
[26351.133019]  [<ffffffffa597ea21>] ? __vfs_write+0xc1/0x120
[26351.133021]  [<ffffffffa597f5c8>] ? vfs_write+0xa8/0x1a0
[26351.133023]  [<ffffffffa598084d>] ? SyS_write+0x3d/0xa0
[26351.133025]  [<ffffffffa5d12ef6>] ? entry_SYSCALL_64_fastpath+0x1e/0xa8
[26351.133027] rcu_sched kthread starved for 1405 jiffies! g1258490 c1258489 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0
[26351.133097] rcu_sched       R  running task        0     8      2 0x00000000
[26351.133099]  ffff9e7792d45080 0000000000000246 ffff9e7792da0000 ffff9e7792d9fe60
[26351.133102]  0000000100773c3b ffff9e7792d9fe00 ffff9e779fc0fa00 0000000100773c39
[26351.133104]  ffffffffa5d0fc8c ffff9e779fc0fa00 ffffffffa5d122b7 0000000ea58817f6
[26351.133107] Call Trace:
[26351.133112]  [<ffffffffa5d0fc8c>] ? schedule+0x2c/0x80
[26351.133114]  [<ffffffffa5d122b7>] ? schedule_timeout+0x127/0x240
[26351.133116]  [<ffffffffa58b7500>] ? del_timer_sync+0x50/0x50
[26351.133119]  [<ffffffffa58b448a>] ? rcu_gp_kthread+0x37a/0x860
[26351.133121]  [<ffffffffa58b4110>] ? force_qs_rnp+0x180/0x180
[26351.133124]  [<ffffffffa587bce8>] ? kthread+0xb8/0xd0
[26351.133126]  [<ffffffffa5d1311f>] ? ret_from_fork+0x1f/0x40
[26351.133128]  [<ffffffffa587bc30>] ? kthread_create_on_node+0x170/0x170