Hi, On Tue, Feb 09, 2016 at 05:52:40PM +0100, Andres Freund wrote: > Hi, > > I'm working on fixing long IO stalls with postgres. After some > architectural changes fixing the worst issues, I noticed that indivdiual > processes/backends/connections still spend more time waiting than I'd > expect. > > In an workload with the hot data set fitting into memory (2GB of > mmap(HUGE|ANNON) shared memory for postgres buffer cache, ~6GB of > dataset, 16GB total memory) I found that there's more reads hitting disk > that I'd expect. That's after I've led Vlastimil on IRC down a wrong > rabbithole, sorry for that. > > Some tinkering and question later, the issue appears to be postgres' > journal/WAL. Which in the test-setup is write-only, and only touched > again when individual segments of the WAL are reused. Which, in the > configuration I'm using, only happens after ~20min and 30GB later or so. > Drastically reducing the volume of WAL through some (unsafe) > configuration options, or forcing the WAL to be written using O_DIRECT, > changes the workload to be fully cached. > > Rik asked me about active/inactive sizing in /proc/meminfo: > Active: 7860556 kB > Inactive: 5395644 kB > Active(anon): 2874936 kB > Inactive(anon): 432308 kB > Active(file): 4985620 kB > Inactive(file): 4963336 kB > > and then said: > > riel | the workingset stuff does not appear to be taken into account for active/inactive list sizing, in vmscan.c > riel | I suspect we will want to expand the vmscan.c code, to take the workingset stats into account > riel | when we re-fault a page that was on the active list before, we want to grow the size of the active list (and > | shrink from inactive) > riel | when we re-fault a page that was never active, we need to grow the size of the inactive list (and shrink > | active) > riel | but I don't think we have any bits free in page flags for that, we may need to improvise something :) > > andres | Ok, at this point I'm kinda out of my depth here ;) > > riel | andres: basically active & inactive file LRUs are kept at the same size currently > riel | andres: which means anything that overflows half of memory will get flushed out of the cache by large write > | volumes (to the write-only log) > riel | andres: what we should do is dynamically size the active & inactive file lists, depending on which of the two > | needs more caching > riel | andres: if we never re-use the inactive pages that get flushed out, there's no sense in caching more of them > | (and we could dedicate more memory to the active list, instead) Yes, a generous minimum size of the inactive list made sense when it was the exclusive staging area to tell use-once pages from use-many pages. Now that we have refault information to detect use-many with arbitrary inactive list size, this minimum is no longer reasonable. The new minimum should be smaller, but big enough for applications to actually use the data in their pages between fault and eviction (i.e. it needs to take the aggregate readahead window into account), and big enough for active pages that are speculatively challenged during workingset changes to get re-activated without incurring IO. However, I don't think it makes sense to dynamically adjust the balance between the active and the inactive cache during refaults. I assume your thinking here is that when never-active pages are refaulting during workingset transitions, it's an indication that inactive cache need more slots, to detect use-many without incurring IO. And hence we should give them some slots from the active cache. However, deactivation doesn't give the inactive cache more slots to use, it just reassigns already occupied cache slots. The only way to actually increase the number of available inactive cache slots upon refault would be to reclaim active cache slots. And that is something we can't do, because we don't know how hot the incumbent active pages actually are. They could be hotter than the challenging refault page, they could be colder. So what we are doing now is putting them next to each other - currently by activating the refault page, but we could also deactivate the incumbent - and let the aging machinery pick a winner. [ We *could* do active list reclaim, but it would cause IO in the case where the incumbent workingset is challenged but not defeated. It's a trade-off. We just decide how strongly we want to protect the incumbent under challenge. ] -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>