Query memory usage greatly in excess of work_mem * query plan steps

Timothy Garnett <tgarnett@xxxxxxxxxxx> · Wed, 11 Jun 2014 18:02:55 -0400

Hi,

I have a query that's pulling data for another system using COPY (query) to STDOUT CSV on a 9.2.4 db (we're in the process of upgrading to 9.3).  The final csv file is large (~75GB, 86 million rows).  The query is also large, consisting of one table (86 million rows) left joined to a total of 30 other tables (of mixed size), 3 of which are CTE supplied by a WITH clause of and consist of 3 joins each for a total of 39 joins in the plan. work_mem on the system is set to 256MB.

We're running into problems with the machine running out of memory with this single query process consuming over 100GB resident memory before the machine exhausts swap and the Linux OOM handling eventually kills it.  The query plan from explain comes to 186 rows, which assuming that each row requires the full work_mem (which should be a significant overestimate of the number operations and size) is < 50GB and we're observing substantially more then that. Is it reasonable to expect that a query will take ~ < work_mem * # of operations, or are there other factors in play?

The plan looks reasonable (though there are some odd right join uses, see below) and the row estimates look pretty accurate with the exception that one of the CTE queries is under-estimated row count wise by a little over 2 orders of magnitude (260k vs. 86 million rows). That query does a group by (plans as a sort then group aggregate, there are no hash aggregates in the plan which is something that might increase memory) and the group part miss-estimates the final number of rows for that CTE. Unlike the other CTEs when it's merged joined into the main query there's no materialize line in the plan (no idea if that's relevant).

As to the right join (used for a few of the joins, most are left join or merge):
                     ->  Hash Right Join (cost=225541299.19..237399743.38 rows=86681834 width=1108)

                           Hash Cond: (xxx.xxx = yyy.yyy)
                           ->  Seq Scan on xxx (cost=0.00..6188.18 rows=9941 width=20)
                                 Filter: (mode = 'live'::text)

                           ->  Hash  (cost=212606744.27..212606744.27 rows=86681834 width=1096)
                                  ....
I'm not sure if I'm reading it right, but it looks like it's hashing the 86 million row set and scanning over the 10k row set which seems to me like the opposite of what you'd want to do, but I haven't seen a lot of hash right joins in plans and I'm not sure if that's how it works.

Tim