Search Postgresql Archives

Re: Problem search on text arrays, using the overlaps (&&) operator

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




----- "nha" <lyondif02@xxxxxxx> wrote:

> From: "nha" <lyondif02@xxxxxxx>
> To: "John Cheng" <jlcheng@xxxxxxxxx>
> Cc: pgsql-general@xxxxxxxxxxxxxx
> Sent: Monday, July 6, 2009 9:12:22 AM GMT -08:00 US/Canada Pacific
> Subject: Re:  Problem search on text arrays, using the overlaps (&&) operator
>
> Hello,
> 
> With your exhaustive example statements based on table foo and cars, I
> 
> performed some measures on my side (PostgreSQL 8.3.1 server). Here are
> 
> some statistical results:
> 
[ ... snipped ... ]
> 
> In my opinion, analysis and optimization may be deepen over table 
> indexes used for join planning. As your reported query plans show, the
> 
> Where clauses are performed independantly from the table ml_lead; the
> 
> reason is that all the attributes of the clauses belong to the table 
> lead_reporting_data. Time may be reduced on join condition
> achievements.
> 
> Hoping this observation will contribute a little to your opinion.
> 
> Without any claim, I attached a document to this email for details on
> 
> the measures I took with the overlap operator -- OpenDocument 
> Spreadsheet (ODS) v2 formatted file, 24 kiB. The 3rd sheet "various" 
> presents the detailed measures related to the data reported in this
> email.
> 
> Regards.
> 
> --
> nha / Lyon / France.

Hi nha,

I had not expected anyone to go to such lengths to evaluate my
situation, thank you so much!

After looking at your analysis, I realized that the test case I
created isn't close enough to the queries running in our prod
environment. For one, table 'foo' does not join to another table; The
other thing is that the amount of data isn't the same; Finally, these
tables have been ANALYZED.

So I took some time to update the test case. On our server, running
8.3.6, I was able to reproduce the difference between the two styles:
"arr&&{f,b}" and "arr&&{f} or arr&&{b}".

First, the setup:

-- Begin test case
-- Sets up 'bar'
SELECT id INTO TEMP TABLE bar FROM (SELECT generate_series(1,300000) as id) AS bar;
CREATE INDEX bar_idx ON bar (id);
ANALYZE bar;
-- Sets up 'foo'
CREATE TEMP SEQUENCE foo_bar_id_seq;
CREATE TEMP TABLE foo (
    bar_id numeric DEFAULT NEXTVAL('foo_bar_id_seq'),
    keywords text[]
);
CREATE INDEX foo_idx ON foo USING gin (keywords);
INSERT INTO foo (keywords) VALUES ('{ford}'::text[]);
INSERT INTO foo (keywords) VALUES ('{toyota}'::text[]);
INSERT INTO foo (keywords) VALUES ('{volkswagen}'::text[]);
INSERT INTO foo (keywords) VALUES ('{saturn}'::text[]);
INSERT INTO foo (keywords) VALUES ('{honda}'::text[]);
INSERT INTO foo (keywords) VALUES ('{porsche}'::text[]);
INSERT INTO foo (keywords) VALUES ('{porsche, audi, chrysler}'::text[]);
INSERT INTO foo (keywords) VALUES ('{honda, hummer, ferrari}'::text[]);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
ANALYZE foo;
-- End test case

Query for the form "arr&&{f,b}"
SELECT 
    count(*) 
FROM foo
INNER JOIN bar ON foo.bar_id = bar.id
WHERE
    foo.keywords && '{ford, toyota, volkswagen, saturn, honda, porsche, hummer, ferrari}'::text[];

Query for the form "arr&&{f} or arr&&{b}":
SELECT 
    count(*)
FROM foo, bar
WHERE 
    foo.bar_id = bar.id
    AND
    (
        keywords && '{ford}'::text[]
        OR keywords && '{toyota}'::text[] 
        OR keywords && '{volkswagen}'::text[] 
        OR keywords && '{saturn}'::text[] 
        OR keywords && '{honda}'::text[] 
        OR keywords && '{porsche}'::text[] 
        OR keywords && '{hummer}'::text[] 
        OR keywords && '{ferrari}'::text[] 
    );

For the first form, "arr&&{f,b}", the query takes about 15
seconds. For the second form "arr&&{f} or arr&&{b}", we get about 8
seconds. The difference is around 90-100%, which is what I am seeing on
our real life queries.

The query plans also become similar to the real life query plans. But
I am having a hard time learning from them. The only interesting I see
is that the estimated cost seems to be different than the actual run
time. The second form has a higher estimated cost than the first form,
but has a lower run time.

In this test case, the query filters by any of 8 keywords. Note that
with just 2 keywords, the difference is only about 5 seconds. The
specific report that our users complained about involved 16
"keywords", where the difference is about 100%.

When you said "analysis and optimization may be deepen over table 
indexes used for join planning", I'm not sure what you mean. Can you
clarify? 



      


-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux