Thank you Dave. I've opened an SR with GP and see if they have any good suggestion on changing the plan.
Thanks, Suya From: David Rowley [dgrowleyml@xxxxxxxxx]
Sent: Tuesday, October 28, 2014 6:06 PM To: Huang, Suya Cc: pgsql-performance@xxxxxxxxxxxxxx Subject: Re: unnecessary sort in the execution plan when doing group by On Tue, Oct 28, 2014 at 7:26 PM, Huang, Suya
<Suya.Huang@xxxxxxxxxxxxxxx> wrote:
Likely this is the wrong place to ask for help. The plan output that you've pasted below looks very different to PostgreSQL's EXPLAIN output.
QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Gather Motion 24:1 (slice2; segments: 24) (cost=31286842.08..31287447.81 rows=1683 width=536) Rows out: 15380160 rows at destination with 14860 ms to first row, 23856 ms to end, start offset by 104 ms. -> HashAggregate (cost=31286842.08..31287447.81 rows=1683 width=536) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Gather Motion 24:1 (slice2; segments: 24) (cost=152269717.33..157009763.41 rows=1196982 width=568) Rows out: 15380160 rows at destination with 35320 ms to first row, 70091 ms to end, start offset by 102 ms. -> GroupAggregate (cost=152269717.33..157009763.41 rows=1196982 width=568) Most likely the reason you're getting the difference in plan is because the planner is probably decided that there will be too many hash entries for a hash table based on the 3 grouping columns... Look at the estimates, 1683 with
2 columns and 1196982 with the 3 columns. If those estimates turned out to be true, then the hash table for 3 columns will be massively bigger than it would be with 2 columns. With PostgreSQL you might see the plan changing if you increased the work_mem
setting. For greenplum, I've no idea if that's the same.
Databases are often not very good at knowing with the number of distinct values would be over more than 1 column. Certain databases have solved this with multi column statistics, but PostgreSQL does not have these. Although I just noticed last night that
someone is working on them.
Regards
David Rowley
|