Re: Caching/buffers become useless after some time

Vlastimil Babka <vbabka@xxxxxxx> · Fri, 27 Jul 2018 13:15:33 +0200

On 07/21/2018 12:03 AM, Marinko Catovic wrote:
> I let this run for 3 days now, so it is quite a lot, there you go:
> https://nofile.io/f/egGyRjf0NPs/vmstat.tar.gz

The stats show that compaction has very bad results. Between first and
last snapshot, compact_fail grew by 80k and compact_success by 1300.
High-order allocations will thus cycle between (failing) compaction and
reclaim that removes the buffer/caches from memory.

Since dropping slab caches helps, I suspect it's either the slab pages
(which cannot be migrated for compaction) being spread over all memory,
making it impossible to assemble high-order pages, or some slab objects
are pinning file pages making them also impossible to be migrated.

> There is one thing I forgot to mention: the hosts perform find and du (I
> mean the commands, finding files and disk usage)
> on the HDDs every night, starting from 00:20 AM up until in the morning
> 07:45 AM, for maintenance and stats.
> 
> During this period the buffers/caches raise again as you may see from
> the logs, so find/du do fill them.
> Nevertheless as the day passes both decrease again until low values are
> reached.
> I disabled find/du for the night on 19->20th July to compare.
> 
> I have to say that this really low usage (300MB/xGB) occured just once
> after I upgraded from 4.16 to 4.17, not sure
> why, where one can still see from the logs that the buffers/cache is not
> using up the entire available RAM.
> 
> This low usage occured the last time on that one host when I mentioned
> that I had to 2>drop_caches again in my
> previous message, so this is still an issue even on the latest kernel.
> 
> The other host (the one that was not measured with the vmstat logs) has
> currently 600MB/14GB, 34GB of free RAM.
> Both were reset with drop_caches at the same time. From the looks of
> this the really low usage will occur again
> somewhat shortly, it just did not come up during measurement. However,
> the RAM should be full anyway, true?

Can you provide (a single snapshot) /proc/pagetypeinfo and
/proc/slabinfo from a system that's currently experiencing the issue,
also with /proc/vmstat and /proc/zoneinfo to verify? Thanks.