Resending for lists which dropped my mail due to attachments. Sorry. plots: https://nofile.io/f/ogwbrwhwBU7/plots.tar.bz2 R script: files <- Sys.glob("vmstat.1*") results <- read.table(files[1], row.names=1) for (file in files[-1]) { tmp2 <- read.table(file)$V2 results <- cbind(results, tmp2) } for (row in row.names(results)) { png(paste("plots/", row, ".png", sep=""), width=1900, height=1150) plot(t(as.vector(results[row,])), main=row) dev.off() } On 10/22/18 3:19 AM, Marinko Catovic wrote: > Am Mi., 29. Aug. 2018 um 18:44 Uhr schrieb Marinko Catovic > <marinko.catovic@xxxxxxxxx>: >> >> >>>> one host is at a healthy state right now, I'd run that over there immediately. >>> >>> Let's see what we can get from here. >> >> >> oh well, that went fast. actually with having low values for buffers (around 100MB) with caches >> around 20G or so, the performance was nevertheless super-low, I really had to drop >> the caches right now. This is the first time I see it with caches >10G happening, but hopefully >> this also provides a clue for you. >> >> Just after starting the stats I reset from previously defer to madvise - I suspect that this somehow >> caused the rapid reaction, since a few minutes later I saw that the free RAM jumped from 5GB to 10GB, >> after that I went afk, returning to the pc since my monitoring systems went crazy telling me about downtime. >> >> If you think changing /sys/kernel/mm/transparent_hugepage/defrag back to its default, while it was >> on defer now for days, was a mistake, then please tell me. >> >> here you go: https://nofile.io/f/VqRg644AT01/vmstat.tar.gz >> trace_pipe: https://nofile.io/f/wFShvZScpvn/trace_pipe.gz >> > > There we go again. > > First of all, I have set up this monitoring on 1 host, as a matter of > fact it did not occur on that single > one for days and weeks now, so I set this up again on all the hosts > and it just happened again on another one. > > This issue is far from over, even when upgrading to the latest 4.18.12 > > https://nofile.io/f/z2KeNwJSMDj/vmstat-2.zip > https://nofile.io/f/5ezPUkFWtnx/trace_pipe-2.gz I have plot the vmstat using the attached script, and got the attached plots. X axis are the vmstat snapshots, almost 14k of them, each for 5 seconds, so almost 19 hours. I can see the following phases: 0 - 2000: - free memory (nr_free_pages) dropping from 48GB to the minimum allowed by watermarks - page cache (nr_file_pages) grows correspondingly 2000 - 6000: - reclaimable slab (nr_slab_reclaimable) grows up to 40GB, unreclaimable slab has same trend but much less - page cache is shrinked correspondingly - free memory remains at miminum 6000 - 12000: - slab usage is slowly declining - page cache slowly growing but there are hiccups - free pages at minimum, growing after 9000, oscillating between 10000 and 12000 12000 - end: - free pages growing sharply - page cache declining sharply - slab still slowly declining I guess the original problem is manifested in the last phase. There might be secondary issue with the slab usage, between 2000 and 6000 but it doesn't seem immeidately connected (?). I can see compaction activity (but not success) increased a lot in the last phase, while direct reclaim is steady from 2000 onwards. This would again suggest high-order allocations. THP doesn't seem to be the cause. Vlastimil