Re: Hash Anti Join performance degradation

panam <panam@xxxxxxx> · Thu, 26 May 2011 09:08:07 -0700 (PDT)

Hi all,

CÃdric Villemain-3 wrote:
> 
> without explaining further why the antijoin has bad performance
> without cluster, I wonder why you don't use this query :
> 
> SELECT  b.id,
>                   max(m.id)
> FROM box b, message m
> WHERE m.box_id = b.id
> GROUP BY b.id;
> 
> looks similar and fastest.
> 
I actually did use a similar strategy in the meantime (during my problem
with the "left join" query we are talking about here all the time) for
mitigation.
It was
SELECT MAX(e.id) FROM event_message e WHERE e.box_id = id
and it performed worse in comparison to the "left join" query in the general
case (i.e. before my problems began).
At the end of this post is an explanation why I think I cannot use the
solution you suggested above.

Kevin Grittner wrote:
> 
>  Each connection can allocate work_mem, potentially several times. 
> On a machines without hundreds of GB of RAM, that pair of settings
> could cause severe swapping.
> 
Indeed, thanks for the warning. These settings are not for production but to
exclude a performance degradation because of small cache sizes.

Kevin Grittner wrote:
> 
> I think you would need a left join to actually get identical
> results:
>  
> SELECT  b.id, max(m.id)
>   FROM box b
>   LEFT JOIN message m ON m.box_id = b.id
>   GROUP BY b.id;
> 
> But yeah, I would expect this approach to be much faster.  Rather
> easier to understand and harder to get wrong, too.
>  
> 
Correct, it is much faster, even with unclustered ids.
However, I think I cannot use it because of the way that query is generated
(by hibernate).
The (simplyfied) base query is just

SELECT b.id from box

the subquery

(SELECT  m1.id FROM message m1 
   LEFT JOIN message m2 
      ON (m1.box_id = m2.box_id  AND m1.id < m2.id ) 
   WHERE m2.id IS NULL AND m1.box_id = b.id) as lastMessageId

is due to a hibernate formula (containing more or less plain SQL) to
determine the last message id for that box. It ought to return just one row,
not multiple. So I am constrained to the subquery in all optimization
attemps (I cannot combine them as you did), at least I do not see how. If
you have an idea for a more performant subquery though, let me know, as this
can easily be replaced.

Thanks for your help and suggestions
panam

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Hash-Anti-Join-performance-degradation-tp4420974p4429125.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance