Search Postgresql Archives

Re: Problem search on text arrays, using the overlaps (&&) operator

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

Le 2/07/09 2:07, John Cheng a écrit :
We use text[] on one of our tables. This text[] column allows us to
search for records that matches a keyword in a set of keywords. For
example, if we want to find records that has a keyword of "foo" or
"bar", we can use the condition:

   keywords&&  '{foo, bar}'::text[]

Another wau is to do this:

   (keywords&&  '{foo}::text[]' OR keywords&&  '{bar}::text[]')

I am noticing a big difference between the two ways. I'm trying to
find out if we need to re-write our queries to speed them up, or
perhaps I am just missing something about how to use text[].

[...]
For some reason, I am seeing a big difference in our real database. I
don't want to just rewrite all of our queries yet. I'm guessing the
data makes a big difference.  What would be a good way to examine the
data to figure out what's the best way to write our queries? Is there
any features in PostgreSQL that can help me improve the performance?

Any advice would be greatly appreciated!


With your exhaustive example statements based on table foo and cars, I performed some measures on my side (PostgreSQL 8.3.1 server). Here are some statistical results:

seq	rtm	delta	ratio	deco
---	---	-----	-----	----
s1cc	873	-1,74%	2//9	1+1
s1or	889	1,71%	2//9	1+1
s2cc	1322	8,53%	3//9	1+2
s2or	1209	-9,32%	3//9	1+2
s3cc	892	-2,61%	2//9	1+(.5*2)
s3or	915	2,54%	2//9	1+(.5*2)
s4cc	511	-3,00%	1//9	(.5*2)
s4or	526	2,91%	1//9	(.5*2)
s5cc	1635	2,13%	4//9	1+1+2
s5or	1600	-2,17%	4//9	1+1+2
---	---	-----	-----	----

seq	where clauses
---	-------------
s1cc	keywords && '{ford, toyota}'::text[]
s1or	keywords && '{ford}'::text[] OR keywords && '{toyota}'::text[]
s2cc	keywords && '{ford, honda}'::text[]
s2or	keywords && '{ford}'::text[] OR keywords && '{honda}'::text[]
s3cc	keywords && '{honda, ferrari}'::text[]
s3or	keywords && '{honda}'::text[] OR keywords && '{ferrari}'::text[]
s4cc	keywords && '{ferrari, hummer}'::text[]
s4or	keywords && '{ferrari}'::text[] OR keywords && '{hummer}'::text[]
s5cc	keywords && '{ford, toyota, porsche}'::text[]
s5or keywords && '{ford}'::text[] OR keywords && '{toyota}'::text[] OR keywords && '{porsche}'::text[]

legend
------
seq	sequence of 10 subsequent explain analyze per row
rtm	runtime mean (in milliseconds) of 10 subsequent measures
delta	difference percentage between cc and or sequences
cc	unique where clause with >1-size table (eg. {foo,bar})
or	multiple where clauses with 1-size text table (eg. {foo})
ratio	ratio between # of result rows and # of table rows
deco result row partition between constant text table values in where clause.

In the following, I refer to your condition forms as:
- arr&&{f,b}
- arr&&{f} or arr&&{b}

I noticed first that, contrarily to your observation, for the "ford or toyata" case (sequence s1 developped into 2 subcases s1cc and s1or for both forms of condition), runtime mean is shorter for s1cc (arr&&{f,t}) than for s1or (arr&&{f} or arr&&{t}). But the difference percentage is only about 1,7% (ie. not enough relevant).

This empirical advantage of form arr&&{f,t} over form (arr&&{f} or arr&&{t}) is also observed for 2 cases (s3 and s4) out of 4 (s2 up to s5). The difference percentage looks more relevant (about 3%). The cases s3 and s4 differ from the others (s1, s2, and s5) by the fact that the sets of matching rows for each compared text table value intersect: all the rows matched by ferrari also match honda (strict inclusion not equality); all the rows matched by ferrari also match hummer and conversely (double inclusion here, ie. equality). In the other 3 cases, each compared text table value matches set of rows without intersecting the matching row set of the other(s) value(s). We may then assume that form arr&&{f,t} would fit better when there are lots of rows matched by several terms--however this cannot be generalized at this stage.

The reported data let us also guess some linear relationship between runtime and # of result rows. Here this relationship seems equally applicable with both forms arr&&{f,t} and (arr&&{f} or arr&&{t}).

Out of these measures and report, I notice that, regarding the query plan explanation of your queries over real data, the difference between actual runtimes reported for each case of condition forms is not so relevant with respect to the overall runtime of the queries. At bitmap heap scan on the table over which conditions are performed, when the last row is retrieved, actual runtime is respectively of:
- for arr&&{f,b}: 1276.990 ms;
- for (arr&&{f) or arr&&{b}): 1211.535 ms.
While quite close (difference percentage of about 5%), the difference is not really harmful with respect to the overall runtimes (resp. 13197 ms and 7768 ms), ie. in terms of part of overall runtimes resp. (1276/13197=) 10% and (1211/7768=) 16 %.

Even if overlap operator runtimes are interchanged between the 2 cases, ratios would be resp. (1211/13197=) 9% and (1276/7768=) 16%, ie. not so different from the current plan.

On the other hand, the hash join runtime part is very huge:
- for arr&&{f,b} case: (13170-1924=) 11246 ms;
- for (arr&&{f) or arr&&{b}) case: (7743-1850=) 5893 ms.
These values represent resp. (11246/13197=) 85% and (5893/7768=) 76% of overall runtimes, ie. clearly dominating the overall query plan.

In my opinion, analysis and optimization may be deepen over table indexes used for join planning. As your reported query plans show, the Where clauses are performed independantly from the table ml_lead; the reason is that all the attributes of the clauses belong to the table lead_reporting_data. Time may be reduced on join condition achievements.

Hoping this observation will contribute a little to your opinion.

Without any claim, I attached a document to this email for details on the measures I took with the overlap operator -- OpenDocument Spreadsheet (ODS) v2 formatted file, 24 kiB. The 3rd sheet "various" presents the detailed measures related to the data reported in this email.

Regards.

--
nha / Lyon / France.

Attachment: overlap-opr-measures.ods
Description: Binary data

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux