Search Postgresql Archives

"like" vs "substring" again

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi list,

last week I asked a question about a query with several joins and a "like" operator which was really slow. When I replaced "like" with "substring" (which was possible because the comparison was simply "bla like '123%'") the query became extremely faster because the query optimizer came to another plan.

Gregory Stark gave me the hint (thanks, Gregory!) that the query optimizer made wrong assumptions about the selectivity of the "like". When I used "substring" the assumptions became better and so it chose a better (faster) plan. I then increased the statistics target of the column and the query with "like" became as fast as when I used "substring". So far, so good.

Now I have a similar problem: I have a query (which doesn't look too complicated to me) which takes about 4.5 hours on a 2 GHz Dual-Core Xeon machine. The query joins several tables and has two comparisons, one with "not like", the other with "substring(...) = ". When I use "like" and "not like" together or "substring(...) = " and "substring(...) <> " together, the query takes about 5 seconds. The plan is identical in both cases and different to the plan when I mix the comparisons. The most obvious difference is the number of rows the query optimizer expects to get from the table which is filtered: It expects 1 if I mix the comparison operators and 84 if I consistently use "like" or "substring". The real number of selected rows is 1667 (from a total of 2884 rows), so both estimations are rather wrong. Note that this is exactly the same column for which I increased the statistics target to 500 after last week's discussion ...

I then set up a test table with the problematic column and filled it with the same data. The test table looks as follows:

      Table "pg_temp_3.temp"
Column |     Type      | Modifiers
--------+---------------+-----------
test   | character(10) | not null

I set the statistics target to 1000 for this column and ran the following queries:

explain analyze select * from temp where test like '11%' and test not like '113%';
                                             QUERY PLAN
------------------------------------------------------------------------------------------------------
Seq Scan on "temp"  (cost=0.00..62.26 rows=39 width=14) (actual time=0.012..1.229 rows=1678 loops=1)
  Filter: ((test ~~ '11%'::text) AND (test !~~ '113%'::text))
Total runtime: 1.655 ms
(3 rows)

explain analyze select * from temp where substring(test from 1 for 2) = '11' and substring(test from 1 for 3) <> '113';
                                                  QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Seq Scan on "temp"  (cost=0.00..91.10 rows=14 width=14) (actual time=0.020..3.282 rows=1678 loops=1)
  Filter: (("substring"((test)::text, 1, 2) = '11'::text) AND ("substring"((test)::text, 1, 3) <> '113'::text))
Total runtime: 3.719 ms
(3 rows)

explain analyze select * from temp where substring(test from 1 for 2) = '11' and test not like '113%';
                                            QUERY PLAN
-----------------------------------------------------------------------------------------------------
Seq Scan on "temp"  (cost=0.00..76.68 rows=1 width=14) (actual time=0.018..2.469 rows=1678 loops=1)
  Filter: (("substring"((test)::text, 1, 2) = '11'::text) AND (test !~~ '113%'::text))
Total runtime: 2.914 ms
(3 rows)

As far as I understand, all queries are semantically identical and have the same result set. However, the query optimizer makes very different estimations about the number of rows the queries would return. All the estimations are far from reality, and at least the last one leads to fatal results when this "where" clause is part of a more complex query.

So I have the following questions:

  1. Why does the query optimizer not recognize that the expressions
     are equivalent?
  2. What can I do to improve the estimation of the query optimizer? I
     tried to create an index (with opclass "bpchar_pattern_ops") which
     was actually used in the first query, but did not improve the
     estimation or the execution speed.

Thanks again for any help!

Regards,
   Christian

--
Deriva GmbH                         Tel.: +49 551 489500-42
Financial IT and Consulting         Fax:  +49 551 489500-91
Hans-Böckler-Straße 2                  http://www.deriva.de
D-37079 Göttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer



---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux