Hi, I'm working on fixing long IO stalls with postgres. After some architectural changes fixing the worst issues, I noticed that indivdiual processes/backends/connections still spend more time waiting than I'd expect. In an workload with the hot data set fitting into memory (2GB of mmap(HUGE|ANNON) shared memory for postgres buffer cache, ~6GB of dataset, 16GB total memory) I found that there's more reads hitting disk that I'd expect. That's after I've led Vlastimil on IRC down a wrong rabbithole, sorry for that. Some tinkering and question later, the issue appears to be postgres' journal/WAL. Which in the test-setup is write-only, and only touched again when individual segments of the WAL are reused. Which, in the configuration I'm using, only happens after ~20min and 30GB later or so. Drastically reducing the volume of WAL through some (unsafe) configuration options, or forcing the WAL to be written using O_DIRECT, changes the workload to be fully cached. Rik asked me about active/inactive sizing in /proc/meminfo: Active: 7860556 kB Inactive: 5395644 kB Active(anon): 2874936 kB Inactive(anon): 432308 kB Active(file): 4985620 kB Inactive(file): 4963336 kB and then said: riel | the workingset stuff does not appear to be taken into account for active/inactive list sizing, in vmscan.c riel | I suspect we will want to expand the vmscan.c code, to take the workingset stats into account riel | when we re-fault a page that was on the active list before, we want to grow the size of the active list (and | shrink from inactive) riel | when we re-fault a page that was never active, we need to grow the size of the inactive list (and shrink | active) riel | but I don't think we have any bits free in page flags for that, we may need to improvise something :) andres | Ok, at this point I'm kinda out of my depth here ;) riel | andres: basically active & inactive file LRUs are kept at the same size currently riel | andres: which means anything that overflows half of memory will get flushed out of the cache by large write | volumes (to the write-only log) riel | andres: what we should do is dynamically size the active & inactive file lists, depending on which of the two | needs more caching riel | andres: if we never re-use the inactive pages that get flushed out, there's no sense in caching more of them | (and we could dedicate more memory to the active list, instead) andres | Sounds sensible. I guess things get really tricky if there's a portion of the inactive list that does get | reused (say if the hot data set is larger than memory), and another doesn't get reused at all. I promised to send an email about the issue... I provide you with a branch of postgres + instructions to reproduce the issue, or I can test patches, whatever you prefer. This test was run using 4.5.0-rc2, but I doubt this is a recent regression or such. Any other information I can provide you with? Regards, Andres -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>