On Tue, Jul 9, 2019 at 4:44 AM Justin Pryzby <pryzby@xxxxxxxxxxxxx> wrote: > [...many partial GroupAggregate/Sort repeated here for ~200 child tables...] > > This isn't the first time I've had to reduce work_mem on a parallel query to > avoid OOM, but it seems unreasonable to be concerned with 50MB sorts. This is a fundamental problem with our model of memory budgets. To oversimplify massively, we can use about work_mem * ntables * the number of parallel workers. Core counts are going up, and now we have a convenient way to get large values of ntables. One of many -hackers threads to discuss the problem: https://www.postgresql.org/message-id/flat/CAH2-WzmNwV%3DLfDRXPsmCqgmm91mp%3D2b4FvXNF%3DcCvMrb8YFLfQ%40mail.gmail.com > It looks like the results of each Sort node stay in RAM, during processing of > additional sort nodes (is that required?) That's a very interesting question linked to whole-query memory budgets. If you don't have a whole-query memory budget, then you have nothing to lose by keeping hash and sort results in memory, and you gain the ability to do cheap rescans (if the node happens to be in a nest loop). I discussed this in the context of hash joins over here: https://www.postgresql.org/message-id/CAEepm%3D0N6DODN7nx6Zb93YOW-y%3DRftNNFZJRaLyG6jbJHJVjsA%40mail.gmail.com The TL;DR is that some other databases throw out eg hash tables as soon as possible, and consider changing the shape of their join nesting in order to minimise the total number of hash tables in memory at once, in order to come in under a certain budget for memory used at any one point in time. That is, they don't just divide query_work_mem up over all the nodes, they understand things about when nodes will allocate and free memory. There be some less clever things we can do, though, before we tackle the big problems involved. We could probably still opportunistically give back memory sooner, when we know there is no chance of rescan, and other things along those lines. > Thanks in advance for any advice. I think it's impossible to choose a single value for work_mem if you have a mixture of types of queries that hit wildly different numbers of partitions and workers. I think this is an ongoing topic for -hackers. -- Thomas Munro https://enterprisedb.com