Re: No memory reclaim while reaching MemoryHigh

Stefan Priebe - Profihost AG <s.priebe@xxxxxxxxxxxx> · Sun, 28 Jul 2019 23:11:53 +0200



here is a memory.stat output of the cgroup:
# cat /sys/fs/cgroup/system.slice/varnish.service/memory.stat
anon 8113229824
file 39735296
kernel_stack 26345472
slab 24985600
sock 339968
shmem 0
file_mapped 38793216
file_dirty 946176
file_writeback 0
inactive_anon 0
active_anon 8113119232
inactive_file 40198144
active_file 102400
unevictable 0
slab_reclaimable 2859008
slab_unreclaimable 22126592
pgfault 178231449
pgmajfault 22011
pgrefill 393038
pgscan 4218254
pgsteal 430005
pgactivate 295416
pgdeactivate 351487
pglazyfree 0
pglazyfreed 0
workingset_refault 401874
workingset_activate 62535
workingset_nodereclaim 0

Greets,
Stefan

Am 26.07.19 um 20:30 schrieb Stefan Priebe - Profihost AG:
> Am 26.07.19 um 09:45 schrieb Michal Hocko:
>> On Thu 25-07-19 23:37:14, Stefan Priebe - Profihost AG wrote:
>>> Hi Michal,
>>>
>>> Am 25.07.19 um 16:01 schrieb Michal Hocko:
>>>> On Thu 25-07-19 15:17:17, Stefan Priebe - Profihost AG wrote:
>>>>> Hello all,
>>>>>
>>>>> i hope i added the right list and people - if i missed someone i would
>>>>> be happy to know.
>>>>>
>>>>> While using kernel 4.19.55 and cgroupv2 i set a MemoryHigh value for a
>>>>> varnish service.
>>>>>
>>>>> It happens that the varnish.service cgroup reaches it's MemoryHigh value
>>>>> and stops working due to throttling.
>>>>
>>>> What do you mean by "stops working"? Does it mean that the process is
>>>> stuck in the kernel doing the reclaim? /proc/<pid>/stack would tell you
>>>> what the kernel executing for the process.
>>>
>>> The service no longer responses to HTTP requests.
>>>
>>> stack switches in this case between:
>>> [<0>] io_schedule+0x12/0x40
>>> [<0>] __lock_page_or_retry+0x1e7/0x4e0
>>> [<0>] filemap_fault+0x42f/0x830
>>> [<0>] __xfs_filemap_fault.constprop.11+0x49/0x120
>>> [<0>] __do_fault+0x57/0x108
>>> [<0>] __handle_mm_fault+0x949/0xef0
>>> [<0>] handle_mm_fault+0xfc/0x1f0
>>> [<0>] __do_page_fault+0x24a/0x450
>>> [<0>] do_page_fault+0x32/0x110
>>> [<0>] async_page_fault+0x1e/0x30
>>> [<0>] 0xffffffffffffffff
>>>
>>> and
>>>
>>> [<0>] poll_schedule_timeout.constprop.13+0x42/0x70
>>> [<0>] do_sys_poll+0x51e/0x5f0
>>> [<0>] __x64_sys_poll+0xe7/0x130
>>> [<0>] do_syscall_64+0x5b/0x170
>>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [<0>] 0xffffffffffffffff
>>
>> Neither of the two seem to be memcg related.
> 
> Yes but at least the xfs one is a page fault - isn't this related?
> 
>> Have you tried to get
>> several snapshots and see if the backtrace is stable?
> No it's not it switches most of the time between these both. But as long
> as the xfs one with the page fault is seen it does not serve requests
> and that one is seen for at least 1-5s than the poill one is visible and
> than the xfs one again for 1-5s.
> 
> This happens if i do:
> systemctl set-property --runtime varnish.service MemoryHigh=6.5G
> 
> if i set:
> systemctl set-property --runtime varnish.service MemoryHigh=14G
> 
> i never get the xfs handle_mm fault one. This is reproducable.
> 
>> tell you whether your application is stuck in a single syscall or they
>> are just progressing very slowly (-ttt parameter should give you timing)
> 
> Yes it's still going forward but really really slow due to memory
> pressure. memory.pressure of varnish cgroup shows high values above 100
> or 200.
> 
> I can reproduce the same with rsync or other tasks using memory for
> inodes and dentries. What i don't unterstand is that the kernel does not
> reclaim memory for the userspace process and drops the cache. I can't
> believe those entries are hot - as they must be at least some days old
> as a fresh process running a day only consumes about 200MB of indoe /
> dentries / page cache.
> 
> Greets,
> Stefan
>