Re: looking for a faster way to do that

Alban Hertroys <haramrae@xxxxxxxxx> · Fri, 23 Sep 2011 15:21:27 +0200

On 23 September 2011 14:29, <hamann.w@xxxxxxxxxxx> wrote:
>
> Alban Hertroys wrote:
>
> >> So you're comparing a variable field value to a variable pattern - yeah,
> >> that's going to hurt. There's no way you could index exactly that.
> >>
> >> Perhaps there's some way you can transform the problem so that you get
> >> something indexable?
> >> For example, if your match patterns follow a certain pattern by themselves,
> >> you could add a column with the longest match pattern that would match the
> >> string. Then you could just do a query for which records have the match
> >> pattern (in that new column) that you're looking for and voila!
> >>
> >> If something like that is possible strongly depends on what kind of match
> >> patterns you're using, of course.
>
> Hi Alban,
>
> I already did that - the test set is just all records from the real table (about a million
> entries) that match the common 'ABC' prefix

I think you misunderstood what I wrote. Notice the difference between
"which strings match the pattern" and "which records have the match
pattern (in that new column)" - the first is a regular expression
match (unindexable), while the second is a string equality match
(indexable).

What I'm suggesting is to add a column, which for the string 'ABCDEFG'
would contain 'ABC%'.
Data would look like:

SELECT str, pattern FROM tbl;
 str     | pattern
---------+---------
 ABCDEFG | ABC%
 ABCDEF  | ABC%
 BCDEFGH | BCD%
 etc.

(can't format this properly in webmail, sorry)

When you look for records that match the pattern 'ABC%', you would
normally perform a query like:

SELECT str FROM tbl WHERE str LIKE 'ABC%';

But with this new column, you would query:

SELECT str FROM tbl WHERE pattern = 'ABC%';

As I said, it depends a lot on your pattern needs whether this
solution would work at all for you. If you only ever use a few
patterns, it will work. If you use many different patterns or don't
know before-hand which patterns will be used, it won't work well at
all.

> The main difference is: the fast query looks like
>
> explain select items.num, wantcode from items, n where code = wantcode;
>  Merge Join  (cost=53.56..1104.02 rows=39178 width=36)
>   Merge Cond: (("outer".code)::text = "inner".wantcode)
>   ->  Index Scan using itemsc on items  (cost=0.00..438.75 rows=9614 width=42)
>   ->  Sort  (cost=53.56..55.60 rows=815 width=32)
>         Sort Key: n.wantcode
>         ->  Seq Scan on n  (cost=0.00..14.15 rows=815 width=32)

Is there an index on wantcode? If you have a million or more records,
I would expect an index scan for a measly 815 matches...

> and the slow ones looks like that one:
>
>  Nested Loop  (cost=14.15..176478.01 rows=39178 width=36)
>   Join Filter: (("outer".code)::text ~ "inner".wantcode)
>
> So the database takes an entirely differnet approach at retrieving the entries.

Yes, because you're still using ~ there, with a pattern that's unknown
at query planning time. That will only be fast under some fairly rare
circumstances.

--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general