Yeah, that was it. Thanks! I do have one more question at the bottom, though, if anyone has enough time to read through my analysis
If I create the table as:
CREATE TABLE users
(
userid integer NOT NULL,
location integer NOT NULL,
CONSTRAINT pk_users PRIMARY KEY (userid)
)
WITH (
OIDS=FALSE
);
(
userid integer NOT NULL,
location integer NOT NULL,
CONSTRAINT pk_users PRIMARY KEY (userid)
)
WITH (
OIDS=FALSE
);
CREATE INDEX idx_users_location
ON users
USING btree
(location);
ON users
USING btree
(location);
INSERT INTO users (userid,location) SELECT GENERATE_SERIES(1,1000000) , GENERATE_SERIES(1,1000000)/100000;
UPDATE users SET location=76543 WHERE userid=100092;
UPDATE users SET location=76543 WHERE userid=997000;
UPDATE users SET location=76543 WHERE userid=997000;
So, now we have 10 distinct location values, evenly distributed, one value (10) with only one row and one value (76543) with 2 rows. If, after this setup I do statement C:
explain analyze SELECT userid FROM users, (VALUES (76543), (892), (10)) val (id) WHERE location = val.id;
Nested Loop (cost=0.00..17277.21 rows=300000 width=4) (actual time=0.023..0.06 rows=3 loops=1)
-> Values Scan on "*VALUES*" (cost=0.00..0.04 rows=3 width=4) (actual time0.002..0.004 rows=3 loops=1)
-> Index Scan using idx_users_location on users (cost=0.00..4509.06 rows=10000 width=8) (actual time=0.008..0.009 rows=1 loops=3)
Index Cond: (users.location = "*VALUES*".column1)
Total runtime: 0.078 ms
(5 rows)
-> Values Scan on "*VALUES*" (cost=0.00..0.04 rows=3 width=4) (actual time0.002..0.004 rows=3 loops=1)
-> Index Scan using idx_users_location on users (cost=0.00..4509.06 rows=10000 width=8) (actual time=0.008..0.009 rows=1 loops=3)
Index Cond: (users.location = "*VALUES*".column1)
Total runtime: 0.078 ms
(5 rows)
and if I do statement D:
explain analyze SELECT userid FROM users WHERE location IN (VALUES (76543), (892), (10));
Nested Loop (cost=0.05..17277.24 rows=300000 width=4) (actual time=0.033..0.056 rows=3 loops=1)
-> HashAggregate (cost=0.05..0.08 rows=3 width=4) (actual time=0.012..0.015 rows=3 loops=1)
-> Values Scan on "*VALUES*" (cost=0.00..0.04 rows=3 width=4) (actual time=0.002..0.004 rows=3 loops=1)
-> Index Scan using idx_users_location on users (cost=0.00..4509.06 rows=100000 width=8) (actual time=0.007..0.009 rows=1 loops=3)
Index Cond: (users.location = "*VALUES*".column1)
Total runtime: 0.094 ms
(6 rows)
-> HashAggregate (cost=0.05..0.08 rows=3 width=4) (actual time=0.012..0.015 rows=3 loops=1)
-> Values Scan on "*VALUES*" (cost=0.00..0.04 rows=3 width=4) (actual time=0.002..0.004 rows=3 loops=1)
-> Index Scan using idx_users_location on users (cost=0.00..4509.06 rows=100000 width=8) (actual time=0.007..0.009 rows=1 loops=3)
Index Cond: (users.location = "*VALUES*".column1)
Total runtime: 0.094 ms
(6 rows)
Where C has a slight edge over D (I ran them both about 5 times and it seems like C is approx. 20% faster for this specific data set). So, I think this will work pretty good. However, I'm still curious (for my own education) as to why something like the following has even more of an edge over the previous two alternatives. Statement E:
explain analyze SELECT userid FROM users WHERE location IN (76543, 892, 10);
Bitmap Heap Scan on users (cost=12.91..16.93 rows=1 width=4) (actual time=0.035..0.038 rows=3 loops=1)
Recheck Cond: (location = ANY ('{76543,892,10}'::integer[]))
-> Bitmap Index Scan on idx_users_location (cost=0.00..12.91 rows=1 width=0) (actual time=0.027..0.027 rows=3 loops=1)
Index Cond: (location = ANY ('{76543,892,10}'::integer[]))
Total runtime: 0.072 ms
(5 rows)
Recheck Cond: (location = ANY ('{76543,892,10}'::integer[]))
-> Bitmap Index Scan on idx_users_location (cost=0.00..12.91 rows=1 width=0) (actual time=0.027..0.027 rows=3 loops=1)
Index Cond: (location = ANY ('{76543,892,10}'::integer[]))
Total runtime: 0.072 ms
(5 rows)
For C, the planner estimated 10 thousand rows. For D, the planner estimated 100 thousand rows, yet for E the planner estimated only 1 row, which is the closest to reality. So, is there any way to specify a query that has a SUB-SELECT that returns a small set of values so that the planner treats it similar to how it treats statement E, or does statement E get its additional edge precisely from the fact that the restriction is defined by integer literals? If so, I think it's ok, because it seems like statements C or D will work well enough when the location distribution is realistic, but I'd like to be educated for the future :)
Thanks again!
Eddy
On Sun, Nov 15, 2009 at 3:59 PM, Eddy Escardo-Raffo <eescardo@xxxxxxxxxx> wrote:
Thanks, Tom. I had discarded the possibility of data type mismatch already, which was your first guess, but was wondering if the lopsided distribution of location values would lead the planner to make a decision that is good on average but bad for this particular query, as you point out in your second guess.I'll try populating the test users with a more evenly distributed location field, which will be more realistic anyway, and see if that works out better.BTW, the -1 is not really a dummy value, but it's just a value that we have been using in tests for "fake test location ID". I just started performance measurement for my application and so far had measured performance with every user being in the same default location and things seemed to be going well, so I tried to switch a couple users to a different location and see what happened, and that made performance drop significantly.(even more detail: my queries also limit results to 10 approx, so DB quickly found 10 rows that match location -1, but it took a while to discover there weren't more than 2 rows with the other value).Thanks!EddyOn Sun, Nov 15, 2009 at 3:33 PM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
Eddy Escardo-Raffo <eescardo@xxxxxxxxxx> writes:> The table used in this query is called "users", and it has columns "userid"Oh, after poking at it a bit more, I realize the problem: the planner
> (primary key) and "location".
> The "location" column is indexed.
> The users table has 1 million rows, and all rows have integer typed value
> '-1' for "location" column, except for 2 rows that have the integer value
> '76543'.
doesn't want to use an indexscan because it assumes there's a
significant probability that the search will be for -1 (in which case
the indexscan would be slower than a seqscan, as indeed your results
prove). Even though it could know in this particular case that the
comparison value isn't -1, I doubt that teaching it that would help your
real queries where it will probably be impossible to determine the
comparison values in advance.
I would suggest considering using NULL rather than inventing a dummy
value for unknown locations. The estimation heuristics will play a
lot nicer with that choice.
regards, tom lane