Re: Problem search on text arrays, using the overlaps (&&) operator

nha <lyondif02@xxxxxxx> · Wed, 08 Jul 2009 13:13:50 +0200

Hello,

Le 8/07/09 0:52, John Cheng a écrit :
> I don't mean to be pesky. I was just wondering if there is anything
> else I should try? 
> 
> Should I simply rewrite all queries, change the form
> 
> WHERE textarr && '{foo, bar}'::text[]
> 
> To
> 
> WHERE (textarr && '{foo}'::text[]
> 	   OR textarr && '{bar}'::text[])
> 
> ?
> 

While still puzzled by the big runtime difference you report between the
2 condition forms, I went on assessing these runtimes on my side from
the new case statements that are assumed to reflect more the real world.

Here are some measure results I got: (sorry for this long table)

seq	style	runtime
---	-----	-------	
(db=slf)
N01	OR-EA	6 237
N02	CC-EA	5 250
N03	OR+EA	12 628	
N04	CC+EA	12 700	
N05	OR+EA	15 679	
N06	CC+EA	11 510	
N07	CC-EA	7 712
N08	OR-EA	8 741
N09	CC-EA	4 963
N10	OR-EA	6 499
(db=stg)
N11	CC+EA	15 978	
N12	OR+EA	15 350	
N13	CC-EA	8 102
N14	OR-EA	9 428
N15	OR-EA	5 267
N16	CC-EA	5 017
N17	OR-EA	6 119
N18	CC-EA	4 955
N19	OR+EA	11 722	
N20	CC+EA	11 532	
N21	OR-EA	7 303
N22	CC-EA	5 438
N23	CC-EA	5 519
N24	OR-EA	5 373
N25	OR-EA	5 422
N26	CC-EA	5 064
(db=stg)
N27	CC-EA	8 314
(db=slf)
N28	OR-EA	6 656
(db=stg)
N29	OR-EA	6 760
(db=slf)
N30	CC-EA	6 753
(db=stg)
N31	CC-EA	5 500
(db=slf)
N32	OR-EA	5 907
(db=stg)
N33	OR-EA	5 391
(db=slf)
N34	CC-EA	5 517
---	-----	----------	

Legend
------
seq: sequence order.
style: condition style of query.
	CC: style "arr&&{f,b}" (one clause with multi-value text table).
	OR: style "arr&&{f} or arr&&{b}" (many clauses with 1-value text table).
	OR2: same style as style OR, with explicit JOIN in query expression.
	+EA: performed with EXPLAIN ANALYZE on query. Slower.
	-EA: performed without EXPLAIN ANALYZE on query. Faster.
runtime: run time in milliseconds.
(db=?): indicates that the following sequences have been performed after
a drop-and-create process for all the tables and indexes.
------

Results from 2 selected EXPLAIN ANALYZE sequences:

-- seq 03 (OR+EA)
Aggregate  (cost=37630.52..37630.53 rows=1 width=0) (actual
time=12628.182..12628.184 rows=1 loops=1)
  ->  Hash Join  (cost=25989.12..37601.04 rows=11792 width=0) (actual
time=8796.002..12231.422 rows=300000 loops=1)
        Hash Cond: ((bar.id)::numeric = foo.bar_id)
        ->  Seq Scan on bar  (cost=0.00..4328.00 rows=300000 width=4)
(actual time=0.025..402.458 rows=300000 loops=1)
        ->  Hash  (cost=24636.81..24636.81 rows=82425 width=8) (actual
time=8795.027..8795.027 rows=2097152 loops=1)
              ->  Bitmap Heap Scan on foo  (cost=1565.44..24636.81
rows=82425 width=8) (actual time=670.248..5098.109 rows=2097152 loops=1)
                    Recheck Cond: ((keywords && '{ford}'::text[]) OR
(keywords && '{toyota}'::text[]) OR (keywords && '{volkswagen}'::text[])
OR (keywords && '{saturn}'::text[]) OR (keywords && '{honda}'::text[])
OR (keywords && '{porsche}'::text[]) OR (keywords && '{hummer}'::text[])
OR (keywords && '{ferrari}'::text[]))
                    ->  BitmapOr  (cost=1565.44..1565.44 rows=83879
width=0) (actual time=665.516..665.516 rows=0 loops=1)
                          ->  Bitmap Index Scan on foo_idx
(cost=0.00..175.07 rows=10485 width=0) (actual time=114.013..114.013
rows=262144 loops=1)
                                Index Cond: (keywords && '{ford}'::text[])
                          ->  Bitmap Index Scan on foo_idx
(cost=0.00..175.07 rows=10485 width=0) (actual time=72.398..72.398
rows=262144 loops=1)
                                Index Cond: (keywords && '{toyota}'::text[])
                          ->  Bitmap Index Scan on foo_idx
(cost=0.00..175.07 rows=10485 width=0) (actual time=74.118..74.118
rows=262144 loops=1)
                                Index Cond: (keywords &&
'{volkswagen}'::text[])
                          ->  Bitmap Index Scan on foo_idx
(cost=0.00..175.07 rows=10485 width=0) (actual time=58.486..58.486
rows=262144 loops=1)
                                Index Cond: (keywords && '{saturn}'::text[])
                          ->  Bitmap Index Scan on foo_idx
(cost=0.00..175.07 rows=10485 width=0) (actual time=114.671..114.671
rows=524288 loops=1)
                                Index Cond: (keywords && '{honda}'::text[])
                          ->  Bitmap Index Scan on foo_idx
(cost=0.00..175.07 rows=10485 width=0) (actual time=115.290..115.290
rows=524288 loops=1)
                                Index Cond: (keywords &&
'{porsche}'::text[])
                          ->  Bitmap Index Scan on foo_idx
(cost=0.00..175.07 rows=10485 width=0) (actual time=58.184..58.184
rows=262144 loops=1)
                                Index Cond: (keywords && '{hummer}'::text[])
                          ->  Bitmap Index Scan on foo_idx
(cost=0.00..175.07 rows=10485 width=0) (actual time=58.336..58.336
rows=262144 loops=1)
                                Index Cond: (keywords &&
'{ferrari}'::text[])
Total runtime: 12628.401 ms
-- seq 03 (OR+EA)

-- seq 04 (CC+EA)
Aggregate  (cost=26726.37..26726.38 rows=1 width=0) (actual
time=12700.620..12700.621 rows=1 loops=1)
  ->  Hash Join  (cost=17879.62..26722.62 rows=1500 width=0) (actual
time=8482.572..12272.449 rows=300000 loops=1)
        Hash Cond: ((bar.id)::numeric = foo.bar_id)
        ->  Seq Scan on bar  (cost=0.00..4328.00 rows=300000 width=4)
(actual time=0.029..412.984 rows=300000 loops=1)
        ->  Hash  (cost=17748.56..17748.56 rows=10485 width=8) (actual
time=8481.524..8481.524 rows=2097152 loops=1)
              ->  Bitmap Heap Scan on foo  (cost=177.69..17748.56
rows=10485 width=8) (actual time=978.464..4679.954 rows=2097152 loops=1)
                    Recheck Cond: (keywords &&
'{ford,toyota,volkswagen,saturn,honda,porsche,hummer,ferrari}'::text[])
                    ->  Bitmap Index Scan on foo_idx  (cost=0.00..175.07
rows=10485 width=0) (actual time=973.569..973.569 rows=2097152 loops=1)
                          Index Cond: (keywords &&
'{ford,toyota,volkswagen,saturn,honda,porsche,hummer,ferrari}'::text[])
Total runtime: 12700.838 ms
-- seq 04 (CC+EA)

As far as I tried, the runtime difference between the 2 forms varies up
to 10% and always under 1 second--ie. far away from the 100% and (15-8=)
7 seconds reported. Moreover the form "arr&&{f,b}" (let us call it: form
1) often shows better than the other (let us call it: form 2)--already
noticed this fact by previous measures without joins.

Exception occurs in a special case: for the first query submitted
immediately after populating tables. It seems as if indexes were not
optimized at this time. But all the following queries, whatever form is
considered, run in lesser and similar time.

When observing the runtime of intermediate steps (bitmap heap scan on
table foo [step 1], sequential scan on table bar [step 2], hash join
[step 3]), it can be noted that all 3 steps run in a comparable time
whatever form used:
- no relevant difference for steps 2 and 3, but an attended long time
for retrieving all the resulting rows because of their high number;
- for step 1: form 1 (arr&&{f,b}) needs more time to extract the 1st row
of the resultset but is faster to collect the last rows. So no
significant difference may be also revealed here.

As I mentioned, I noticed that the runtime is longer for the 1st query
performed after populating tables and that, since the 2nd query
performed, runtimes clearly decrease towards a mean (about 6 seconds on
my side). Maybe the runtimes reported (15 and 8 seconds respectively)
occur in such a case (just after populating tables) with a performing
order starting with form 1. As much as I saw, the sequence order acts
upon the runtime reduction. Moreover successive queries tend to mean and
equal runtimes with the 2 forms.

About the user complaint, I would be not so astonished because of the
duration of one query (more than 5 seconds in the 2 cases). There are so
many rows to select that it seems not really possible to get a lesser
runtime.

What I may conclude: query plans show difference in performing the
queries: each Where clause is assessed before a commun heap scan of
table foo. The Where clause of the form 1 (multi-value text table) is
longer to run because of multi-value to test for each key; the Where
clause of the form 2 (1-value text table) is faster but the addition of
the corresponding 1-value text table Where clauses runs in a comparable
time. This observation resulted from measures on my sides and may not
correctly reflect some proper configurations of the real production
environment.

Opting to form 2 (arr&&{f} or arr&&{b}) will surely not make runtime
worst. So it would not be a bad choice to implement it if convinced.

Another way could concern the hash join. It has been shown that this
step costs a lot with respect to the overall runtime. Depending on
available storage space and DBMS load, a kind of materialized view may
be handled in order to cut off the overloading join. Here are some
suggested statements to create this helper table:

-- Creates a table jfoobar wrt. tables foo and bar
-- (max 10 seconds elapsed on my platform)
CREATE OR REPLACE TABLE jfoobar AS
SELECT
    foo.bar_id, foo.keywords
FROM foo
INNER JOIN bar ON foo.bar_id = bar.id;
-- Creates an index for jfoobar
-- (max 3 seconds elapsed on my platform)
--  this way...
CREATE INDEX jfoobar_idx ON jfoobar(bar_id, keywords);
--  or this one...
DROP INDEX jfoobar_idx;
CREATE INDEX jfoobar_idx ON jfoobar USING gin(keywords);
--

This table may be updated or replaced on a regular time base or by
triggered event on updating foo or bar tables.

Finally, any query of the form 1 or 2 will run in less than 300 ms:
-- form 1: unique text table
SELECT	count(*) FROM jfoobar
WHERE
    keywords && '{ford, toyota, volkswagen, saturn, honda, porsche,
hummer, ferrari}'::text[];

-- form 2: developped text table
SELECT	count(*) FROM jfoobar
WHERE
    (
        keywords && '{ford}'::text[]
        OR keywords && '{toyota}'::text[]
        OR keywords && '{volkswagen}'::text[]
        OR keywords && '{saturn}'::text[]
        OR keywords && '{honda}'::text[]
        OR keywords && '{porsche}'::text[]
        OR keywords && '{hummer}'::text[]
        OR keywords && '{ferrari}'::text[]
    );
--

The runtime would only depend on the update strategy for the join table.

Regards.

--
nha / Lyon / France.

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general