Hi,
We're running into problems with the machine running out of memory with this single query process consuming over 100GB resident memory before the machine exhausts swap and the Linux OOM handling eventually kills it. The query plan from explain comes to 186 rows, which assuming that each row requires the full work_mem (which should be a significant overestimate of the number operations and size) is < 50GB and we're observing substantially more then that. Is it reasonable to expect that a query will take ~ < work_mem * # of operations, or are there other factors in play?
The plan looks reasonable (though there are some odd right join uses, see below) and the row estimates look pretty accurate with the exception that one of the CTE queries is under-estimated row count wise by a little over 2 orders of magnitude (260k vs. 86 million rows). That query does a group by (plans as a sort then group aggregate, there are no hash aggregates in the plan which is something that might increase memory) and the group part miss-estimates the final number of rows for that CTE. Unlike the other CTEs when it's merged joined into the main query there's no materialize line in the plan (no idea if that's relevant).
As to the right join (used for a few of the joins, most are left join or merge):
-> Hash Right Join (cost=225541299.19..237399743.38 rows=86681834 width=1108)
Hash Cond: (xxx.xxx = yyy.yyy)
-> Seq Scan on xxx (cost=0.00..6188.18 rows=9941 width=20)
Filter: (mode = 'live'::text)
-> Hash (cost=212606744.27..212606744.27 rows=86681834 width=1096)
....
I'm not sure if I'm reading it right, but it looks like it's hashing the 86 million row set and scanning over the 10k row set which seems to me like the opposite of what you'd want to do, but I haven't seen a lot of hash right joins in plans and I'm not sure if that's how it works.
Tim