Re: Performance regression from 8.3.7 to 9.0.3

Merlin Moncure <mmoncure@xxxxxxxxx> · Mon, 14 Mar 2011 12:34:31 -0500



On Mon, Mar 14, 2011 at 10:54 AM, Claudio Freire <klaussfreire@xxxxxxxxx> wrote:
> Nothing?
>
> No ideas?
>
> Did I forget to include some useful bit?
>
> On Fri, Mar 4, 2011 at 8:22 PM, Claudio Freire <klaussfreire@xxxxxxxxx> wrote:
>> Hello, first post to this list.
>>
>> I have this query that ran in milliseconds in postgres 8.3.7 (usually 50,
>> 100ms), and now it takes a full 10 minutes to complete.
>>
>> I tracked the problem to the usage of hash aggregates to resolve EXISTS
>> clauses in conjunction with large IN clauses, which seem to reverse the
>> execution plan - in 8.3.7, it would use indices to fetch the rows from the
>> IN, then compute the exists with a nested loop, never doing big sequential
>> scans. In 9.0.3, it computes the set of applicable entries with a hash
>> aggregate, but in order to do that it performs a huge index scan - no
>> sequential scans either, but the big index scan is worse.
>>
>> 9.0.3 always misses the estimate of how many rows will come out the hash
>> aggregate, always estimating 200, while in fact the real count is more like
>> 300.000. I've tried increasing statistics in all the columns involved, up to
>> 4000 for each, to the point where it accurately estimates the input to the
>> hash agg, but the output is always estimated to be 200 rows.
>>
>> Rewriting the query to use 0 < (select count(*)..) instead of EXISTS (select
>> * ..) does revert to the old postgres 8.3 plan, although intuitively I would
>> think it to be sub-optimal.
>>
>> The tables in question receive many updates, but never in such a volume as
>> to create enough bloat - plus, the tests I've been running are on a
>> pre-production server without much traffic (so not many updates - probably
>> none in weeks).
>>
>> The server is a Core 2 E7400 dual core with 4GB of ram running linux and a
>> pg 9.0.3 / 8.3.7 (both there, doing migration testing) built from source.
>> Quite smaller than our production server, but I've tested the issue on
>> higher-end hardware and it produces the same results.
>>
>> Any ideas as to how to work around this issue?
>>
>> I can't plug the select count() version everywhere, since I won't be using
>> this form of the query every time (it's generated programatically with an
>> ORM), and some forms of it perform incredibly worse with the select count().
>>
>> Also, any help I can provide to fix it upstream I'll be glad to - I believe
>> (I would have to check) I can even create a dump of the tables (stripping
>> sensitive info of course) - only, well, you'll see the size below - a tad
>> big to be mailing it ;-)
>>
>> pg 9.0 is configured with:
>>
>> work_mem = 64M
>> shared_buffers = 512M
>> temp_buffers = 64M
>> effective_cache_size = 128M
>>
>> pg 8.3.7 is configured with:
>>
>> work_mem = 64M
>> shared_buffers = 100M
>> temp_buffers = 64M
>> effective_cache_size = 128M
>>
>>
>> The query in question:
>>
>>> SELECT member_statistics.member_id
>>>         FROM member_statistics
>>>         WHERE member_statistics.member_id IN ( <<400 ids>> ) AND (EXISTS
>>> (SELECT mat1.tag_id
>>>         FROM member_all_tags_v AS mat1
>>>         WHERE mat1.member_id = member_statistics.member_id AND mat1.tag_id
>>> IN (640, 641, 3637, 3638, 637, 638, 639) AND mat1.polarity >= 90))

hm the regression in and of itself is interesting, but I wonder if you
can get past your issue like this:

SELECT member_statistics.member_id
         FROM member_statistics
         WHERE member_statistics.member_id IN ( <<400 ids>> ) AND (EXISTS
 (SELECT mat1.tag_id
         FROM member_all_tags_v AS mat1
         WHERE mat1.member_id = member_statistics.member_id AND mat1.tag_id
 IN (640, 641, 3637, 3638, 637, 638, 639) AND mat1.polarity >= 90))

changes to:

SELECT member_statistics.member_id
         FROM member_statistics
         WHERE EXISTS
         (
           SELECT mat1.tag_id
           FROM member_all_tags_v AS mat1
           WHERE mat1.member_id = member_statistics.member_id
             AND mat1.tag_id
             IN (640, 641, 3637, 3638, 637, 638, 639) AND
mat1.polarity >= 90
             AND mat1.member_id  IN ( <<400 ids>> )
         )

also, always try to compare vs straight join version:


SELECT member_statistics.member_id
         FROM member_statistics
         JOIN VALUES ( <<400 ids>> ) q(member_id) using (member_id)
         JOIN
         (
           SELECT mat1.member_id
           FROM member_all_tags_v AS mat1
           WHERE mat1.tag_id  IN (640, 641, 3637, 3638, 637, 638, 639)
             AND mat1.polarity >= 90) p
            USING(member_id)
          ) p using(member_id);

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance