Unexpected sequential scan on an indexed column

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi, everyone.
 
Between postres docs, forum posts, previous similar questions answered and random blogs, I've read as much as I could about why others have had similar problems in the past before turning to you guys for help, so I really hope this is not some completely obvious oversight on my part (I am fairly new to DB programming after all).
 
So, my postgres version is: PostgreSQL 8.4.1, compiled by Visual C++ build 1400, 32-bit
 
The table used in this query is called "users", and it has columns "userid" (primary key) and "location".
The "location" column is indexed.
The users table has 1 million rows, and all rows have integer typed value '-1' for  "location" column, except for 2 rows that have the integer value '76543'.
 
I've attached a file with SQL commands that will setup this condition.
 
Then I run statement A
SELECT userid FROM users, (VALUES (76543)) val (id) WHERE location = val.id;
 
and the correct 2 results are returned, but after much more time than I would expect, since the location column is indexed.
I know that if all I wanted was the results from this specific query I could simply do statement B
 
SELECT userid FROM users WHERE location = 76543;
 
and that works 100% of the time, at the speed that I would expect it to. However, the example I'm giving here is a simplification of significantly more complex statements that involve more joins and such, where I'm trying to minimize round trips to database, and I've narrowed things down to the point where I think that if I can figure out how to make something like statement A perform well, then the overall performance problem will be pretty easy to fix.
 
So, when I run explain analyze for statement A I get the following:
 
  Nested Loop  (cost=0.00..27906.01 rows=1000000 width=8) (actual time=15.670..5411.503 rows=2 loops=1)
   Join Filter: (users.location = "*VALUES*".column1)
   ->  Values Scan on "*VALUES*"  (cost=0.00..0.01 rows=1 width=4) (actual time=0.005..0.007 rows=1 loops=1)
   ->  Seq Scan on users  (cost=0.00..15406.00 rows=1000000 width=12) (actual time=0.028..2903.398 rows=1000000 loops=1)
 Total runtime: 5411.620 ms
(5 rows)
 
Note that I did run VACUUM ANALYZE before running EXPLAIN ANALYZE.
 
I was curious about why the index wasn't being used so I forced it to be used through "SET enable_seqscan TO off", and then saw the following EXPLAIN ANALYZE output:
 
 Nested Loop  (cost=0.00..43889.37 rows=1000000 width=8) (actual time=5813.875..5813.897 rows=2 loops=1)
   Join Filter: (users.location = "*VALUES*".column1)
   ->  Values Scan on "*VALUES*"  (cost=0.00..0.01 rows=1 width=4) (actual time=0.004..0.006 rows=1 loops=1)
   ->  Index Scan using idx_users_location on users  (cost=0.00..31389.36 rows=1000000 width=12) (actual time=0.375..2967.249 rows=1000000 loops=1)
 Total runtime: 5814.029 ms
 
So, even when we use the index, the planner seems to force the query to scan through all rows, rather than stopping the scan once it can knows that there will be no more rows returned (given the presence of the index).
 
If I use a ORDER BY clause to force the table scan to happen in descending order by location, then the SQL statement C performs wonderfully:
 
postgres=# explain analyze SELECT userid FROM users2, (VALUES (76543)) val (id) WHERE location = val.id ORDER BY location DESC;
 
But that's only due to the specific values used in this example and wouldn't work in general. If we ORDER_BY ascendingly, then the performance is still really slow. So, basically the planner seems to always want to do a sequential scan of the entire index, without placing any restriction on the index, and it may abort the full index scan early under ordered conditions, if the scan gets lucky.
 
Do you guys have any idea why this is not working as I expect? What I hope to accomplish is to have a construct such as the table I labeled "val" obtained from a sub-select. Given the size of the pool from which I'm selecting these values, I very rarely expect the number of values in the sub-select results to exceed 10, so I was hoping that the database would try to do something like a bitmapped scan after restricting the user table to the values in the small value table. Maybe it's not doing so given the lopsided distribution of location values in the users table, but I'm just not sure.
 
Any help is appreciated.
 
Thanks!
Eddy
CREATE TABLE users
(
userid bigserial NOT NULL,
location integer NOT NULL,
CONSTRAINT pk_users PRIMARY KEY (userid)
)
WITH (
OIDS=FALSE
);

INSERT INTO users (userid,location) SELECT -1*GENERATE_SERIES(1,1000000),-1;

CREATE INDEX idx_users_location
  ON users
  USING btree
  (location);

UPDATE users SET location=76543 WHERE userid=-100092;
UPDATE users SET location=76543 WHERE userid=-997000;
-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux