On 2013-01-31, hamann.w@xxxxxxxxxxx <hamann.w@xxxxxxxxxxx> wrote: > > Pavel Stehlule wrote: > >>> >> Hi, >>> >> >>> >> I am trying to match items from 2 tables based on a common string. >>> >> One is a big table which has one column with entries like XY123, ABC44, = >>> etc >>> >> The table has an index on that column. >>> >> The second table is, typically, much smaller >>> >> >>> >> select .... from tab1, tab2 where tab1.code =3D tab2.code; >>> >> >>> >> This works fine and fast. >>> >> Now, as a variant, I have some entries like XY423A, XY423B, GF55A, GF55D= >>> in the >>> >> big table and want them to match XY423, GF55 in the second table >>> >> >>> >> Variants I have tried >>> >> >>> >> select .... from tab1, tab2 where tab1.code ~ (tab2.code||'($|[A-Z])'); >>> >> select .... from tab1, tab2 where tab1.code ~ ('^'||tab2.code||'($|[A-Z= >>> ])'); >>> >> >>> > >>> > Have you tried the substring function? >>> > >>> > select .... from tab1, tab2 where substring(tab1.code from 1 for 5) =3D >>> > tab2.code >>> > > > Hi Pavel, it was just by chance that a fixed size substring would match the > data at hand. It is more common to have a digit/letter (or vice versa) boundary > or a hyphen there > >>> > >>> >> both take an enormous time. In the better case that I can subset (e.g. a= >>> ll candidates in table 2 >>> >> share initial "AX") I get back to manageable times by adding >>> >> and tab1.code ~ '^AX' >>> >> into the recipe. Actual runtime with about a million entries in tab1 and= >>> 800 entries in tab2 >>> >> is about 40 seconds. >>> >>> any join where result is related to some function result can be very >>> slow, because estimation will be out and any repeated function >>> evaluation is just expensive. >>> > I see the problem since obviously every the ~ operator with a non-constant > pattern is constantly recompiling the pattern. > > I wonder whether it would be possible to invent a prefix-match operator that approaches > the performance of string equality. I noted in the past (not sure whether anything > has changed in regex matching) that a constant leading part of regex would improve > performance, i.e. use an index scan to select possible candidates. > you could write a set returning function that opens cursors on both tables using "ORDER BY code" and merges the results -- ⚂⚃ 100% natural -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general