On 20.8.2013 18:24, Tom Lane wrote: > Tomas Vondra <tv@xxxxxxxx> writes: >> I've managed to get the data to a different machine, and I've spent >> some time on debugging it. > > Great, thanks for looking into it! > >> It seems that the difference is in evaluating hashentrysize [ >> choose_hashed_distinct omits hash_agg_entry_size() ] but the >> hashentrysize size is 24 (choose_hashed_distinct) or 56 >> (choose_hashed_grouping). This causes that while _distinct >> evaluates the condition as false, and _grouping as true (and thus >> returns false). > > Hah. > >> Now, the difference between 24 and 56 is caused by >> hash_agg_entry_size. It's called with numAggs=0 but returns 32. I'm >> wondering if it should return 0 in such cases, i.e. something like >> this: > > No, I don't think so. I'm pretty sure the reason > choose_hashed_distinct is like that is that I subconsciously assumed > hash_agg_entry_size would produce zero for numAggs = 0; but in fact > it does not and should not, because there's still some overhead for > the per-group hash entry whether or not there's any aggregates. So > the right fix is that choose_hashed_distinct should add > hash_agg_entry_size(0) onto its hashentrysize estimate. > > A separate issue is that the use of numAggs-1 in > hash_agg_entry_size's calculations seems a bit risky if numAggs can > be zero - I'm not sure we can rely on compilers to get that right. > I'm inclined to replace that with use of offsetof. Likewise in > build_hash_table. > >> I've tested that after this both queries use HashAggregate (which >> is the right choice), but I haven't done any extensive checking so >> maybe I'm missing something. > > It might be the preferable choice in this example, but you're looking > at an edge case. If you want the thing to be using a hash aggregate > for this size of problem, you should increase work_mem. Hmmm. I think the main 'issue' here is that the queries behave quite differently although it seems like they should do the same thing (well, I understand they're not the same). We're already using work_mem='800MB' so there's not much room to increase this. Actually, this is probably the main reason why we haven't seen this issue more often, because the other dataset are smaller (but that won't last for long, because of steady growth). A complete explain analyze for the HashAggregate plan is available here: http://explain.depesz.com/s/jCO The estimates seem to be pretty exact, except for the very last step: HashAggregate (cost=1399795.00..1604305.06 rows=20451006 width=8) (actual time=13985.580..14106.708 rows=355600 loops=1) So, the estimate is ~60x higher than the actual value, which then happens to work for choose_hashed_distinct (because it uses much lower value for hashentrysize), but for choose_hashed_grouping this is actually above the threshold. But then again, the actual number of rows is much lower than the estimate so that the amount of memory is actually well within work_mem so it does not cause any trouble with OOM. So I don't think increasing the work_mem is a good long-term solution here, because the main problem here is the estimate. Another sign I should probably start working on the multi-column indexes as I planned for a long time ... Anyway, I still don't understand why the same logic around hash_agg_entry_size should not apply to choose_hashed_grouping as well? Well, it would make it slower in this particular corner case, but wouldn't it be more correct? Tomas -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance