Re: Hash Anti Join performance degradation

Cédric Villemain <cedric.villemain.debian@xxxxxxxxx> · Thu, 26 May 2011 20:13:22 +0200

2011/5/26 panam <panam@xxxxxxx>:
> Hi all,
>
>
> Cédric Villemain-3 wrote:
>>
>> without explaining further why the antijoin has bad performance
>> without cluster, I wonder why you don't use this query :
>>
>> SELECT  b.id,
>>                   max(m.id)
>> FROM box b, message m
>> WHERE m.box_id = b.id
>> GROUP BY b.id;
>>
>> looks similar and fastest.
>>
> I actually did use a similar strategy in the meantime (during my problem
> with the "left join" query we are talking about here all the time) for
> mitigation.
> It was
> SELECT MAX(e.id) FROM event_message e WHERE e.box_id = id
> and it performed worse in comparison to the "left join" query in the general
> case (i.e. before my problems began).
> At the end of this post is an explanation why I think I cannot use the
> solution you suggested above.
>
>
> Kevin Grittner wrote:
>>
>>  Each connection can allocate work_mem, potentially several times.
>> On a machines without hundreds of GB of RAM, that pair of settings
>> could cause severe swapping.
>>
> Indeed, thanks for the warning. These settings are not for production but to
> exclude a performance degradation because of small cache sizes.
>
>
> Kevin Grittner wrote:
>>
>> I think you would need a left join to actually get identical
>> results:
>>
>> SELECT  b.id, max(m.id)
>>   FROM box b
>>   LEFT JOIN message m ON m.box_id = b.id
>>   GROUP BY b.id;
>>
>> But yeah, I would expect this approach to be much faster.  Rather
>> easier to understand and harder to get wrong, too.
>>
>>
> Correct, it is much faster, even with unclustered ids.
> However, I think I cannot use it because of the way that query is generated
> (by hibernate).
> The (simplyfied) base query is just
>
> SELECT b.id from box
>
> the subquery
>
> (SELECT  m1.id FROM message m1
>   LEFT JOIN message m2
>      ON (m1.box_id = m2.box_id  AND m1.id < m2.id )
>   WHERE m2.id IS NULL AND m1.box_id = b.id) as lastMessageId
>
> is due to a hibernate formula (containing more or less plain SQL) to
> determine the last message id for that box. It ought to return just one row,
> not multiple. So I am constrained to the subquery in all optimization
> attemps (I cannot combine them as you did), at least I do not see how. If
> you have an idea for a more performant subquery though, let me know, as this
> can easily be replaced.

In production, if you have a decent IO system, you can lower
random_page_cost and it may be faster using index (by default, with
the use case you provided it choose a seqscan). It can be a bit tricky
if you have to lower random_page_cost so much that it destroy others
query plan but increase the perf of the current one. if it happens,
post again :) (sometime need to change other cost parameters but it
needs to be handle with care)

I am not an hibernate expert, but I'll surprised if you can not drive
hibernate to do what you want.

>
> Thanks for your help and suggestions
> panam
>
> --
> View this message in context: http://postgresql.1045698.n5.nabble.com/Hash-Anti-Join-performance-degradation-tp4420974p4429125.html
> Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/ ;    PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance