On 07/21/2018 12:03 AM, Marinko Catovic wrote: > I let this run for 3 days now, so it is quite a lot, there you go: > https://nofile.io/f/egGyRjf0NPs/vmstat.tar.gz The stats show that compaction has very bad results. Between first and last snapshot, compact_fail grew by 80k and compact_success by 1300. High-order allocations will thus cycle between (failing) compaction and reclaim that removes the buffer/caches from memory. Since dropping slab caches helps, I suspect it's either the slab pages (which cannot be migrated for compaction) being spread over all memory, making it impossible to assemble high-order pages, or some slab objects are pinning file pages making them also impossible to be migrated. > There is one thing I forgot to mention: the hosts perform find and du (I > mean the commands, finding files and disk usage) > on the HDDs every night, starting from 00:20 AM up until in the morning > 07:45 AM, for maintenance and stats. > > During this period the buffers/caches raise again as you may see from > the logs, so find/du do fill them. > Nevertheless as the day passes both decrease again until low values are > reached. > I disabled find/du for the night on 19->20th July to compare. > > I have to say that this really low usage (300MB/xGB) occured just once > after I upgraded from 4.16 to 4.17, not sure > why, where one can still see from the logs that the buffers/cache is not > using up the entire available RAM. > > This low usage occured the last time on that one host when I mentioned > that I had to 2>drop_caches again in my > previous message, so this is still an issue even on the latest kernel. > > The other host (the one that was not measured with the vmstat logs) has > currently 600MB/14GB, 34GB of free RAM. > Both were reset with drop_caches at the same time. From the looks of > this the really low usage will occur again > somewhat shortly, it just did not come up during measurement. However, > the RAM should be full anyway, true? Can you provide (a single snapshot) /proc/pagetypeinfo and /proc/slabinfo from a system that's currently experiencing the issue, also with /proc/vmstat and /proc/zoneinfo to verify? Thanks.