On Fri, 12 Jul 2024 at 02:08, Dimitrios Apostolou <jimis@xxxxxxx> wrote: > I can't help but see this as a bug. I see many issues: > > * postgres is not reading from partitions in parallel, but one after the > other. It shouldn't need all this memory simultaneously. I don't know for Hash Aggregate, but for nodes like Sort, we still hold onto the tuplestore after returning the last tuple as a rescan might want to read those tuples again. There's also a mark/restore that might want to rewind a little to match up to the next outer tuple of a Merge Join. It might be possible to let go of the memory sooner in plans when returning the final tuple means we'll never need the memory again, but that would require figuring out all the cases where that could happen and ensuring we don't ever release memory when it's required again. > * The memory is unnecessarily allocated early on, before any partitions > are actually aggregated. I know this because I/O is slow on this device > and the table sizes are huge, it's simply not possible that postgres > went through all partitions and blew up the memory. That would take > hours, but the OOM happens seconds after I start the query. That's interesting. Certainly, there is some memory allocated during executor startup, but that amount should be fairly small. Are you able to provide a self-contained test case that shows the memory blowing up before execution begins? > Having wasted long time in that, the minimum I can do is submit a > documentation patch. At enable_partitionwise_aggregate someting like > "WARNING it can increase the memory usage by at least > n_partitions * work_mem". How do I move on for such a patch? Pointers > would be appreciated. :-) I think mentioning something about this in enable_partitionwise_join and enable_partitionwise_aggregate is probably wise. I'll propose a patch on pgsql-hackers. David