Indexed leading substring searches - worked, now doesn't

Wes <wespvp@xxxxxxxxxxxx> · Thu, 03 Feb 2005 13:20:00 -0600

I know my leading substring searches used to be done via indexes.  We
specifically tested that.  Since the last time I tested it, the database has
probably been reloaded to fix a corruption problem.  Now all I can get is
sequential leading substring searches.  In the examples below, the database
was vacuumed last night.

The database is very large (currently about 100 GB, and will be 100's of
gigabytes), so performance is important.  This particular table in the
example has only about 2.8 million rows.

PostgreSQL version is 7.4.5.  Prior to the reload, it was 7.4.1.

The locale is showing up as en_US.iso885915.  As far as I know, it was
always this (default RedHat install), so I don't understand why it worked
before.  Did something change between 7.4.1 and 7.4.5?  I supposed it's
possible that I specified locale=C on the original database and don't
remember that...

I'm not going to have to "initdb --locale=C" and am I?  I looked at index
classes, and that doesn't appear to be something I want to do, due to
performance.  What kind of performance hit do you actually take by using an
index class?

Wes

Pg_controldata shows:

Maximum length of locale name:        128
LC_COLLATE:                           en_US.iso885915
LC_CTYPE:                             en_US.iso885915

narc=> \d addresses
             Table "public.addresses"
   Column    |          Type          | Modifiers
-------------+------------------------+-----------
 address_key | numeric(12,0)          | not null
 address     | character varying(255) | not null
Indexes:
    "addresses_pkey" primary key, btree (address_key)
    "addresses_i_address" btree (address)

narc=> select count(*) from addresses;
  count  
---------
 2829640
(1 row)

narc=> explain select * from addresses where address = 'blah';
                                      QUERY PLAN
----------------------------------------------------------------------------
----------
 Index Scan using addresses_i_address on addresses  (cost=0.00..2.81 rows=1
width=40)
   Index Cond: ((address)::text = 'blah'::text)
(2 rows)

narc=> explain select * from addresses where address like 'blah%';
                          QUERY PLAN
--------------------------------------------------------------
 Seq Scan on addresses  (cost=0.00..61244.68 rows=2 width=40)
   Filter: ((address)::text ~~ 'blah%'::text)
(2 rows)

narc=> explain analyze select * from addresses where address like 'blah%';
                                                  QUERY PLAN
----------------------------------------------------------------------------
----------------------------------
 Seq Scan on addresses  (cost=0.00..61244.68 rows=2 width=40) (actual
time=1445.386..8913.435 rows=6 loops=1)
   Filter: ((address)::text ~~ 'blah%'::text)
 Total runtime: 8913.504 ms
(3 rows)

Something else doesn't make sense..  I did the same query a few minutes ago
and the cost was totally different:

narc=> explain select * from addresses where address like 'blah%';
                                QUERY PLAN
--------------------------------------------------------------------------
 Seq Scan on addresses  (cost=100000000.00..100061244.67 rows=2 width=40)
   Filter: ((address)::text ~~ 'blah%'::text)
(2 rows)

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to majordomo@xxxxxxxxxxxxxx)