Re: Unexpected sequential scan on an indexed column

Eddy Escardo-Raffo <eescardo@xxxxxxxxxx> · Sun, 15 Nov 2009 15:59:31 -0800

Thanks, Tom. I had discarded the possibility of data type mismatch already, which was your first guess, but was wondering if the lopsided distribution of location values would lead the planner to make a decision that is good on average but bad for this particular query, as you point out in your second guess.

I'll try populating the test users with a more evenly distributed location field, which will be more realistic anyway, and see if that works out better.

BTW, the -1 is not really a dummy value, but it's just a value that we have been using in tests for "fake test location ID". I just started performance measurement for my application and so far had measured performance with every user being in the same default location and things seemed to be going well, so I tried to switch a couple users to a different location and see what happened, and that made performance drop significantly.

(even more detail: my queries also limit results to 10 approx, so DB quickly found 10 rows that match location -1, but it took a while to discover there weren't more than 2 rows with the other value).

Thanks!
Eddy

On Sun, Nov 15, 2009 at 3:33 PM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:

Eddy Escardo-Raffo <eescardo@xxxxxxxxxx> writes:

> The table used in this query is called "users", and it has columns "userid"
> (primary key) and "location".
> The "location" column is indexed.

> The users table has 1 million rows, and all rows have integer typed value
> '-1' for  "location" column, except for 2 rows that have the integer value
> '76543'.

Oh, after poking at it a bit more, I realize the problem: the planner

doesn't want to use an indexscan because it assumes there's a
significant probability that the search will be for -1 (in which case
the indexscan would be slower than a seqscan, as indeed your results
prove).  Even though it could know in this particular case that the

comparison value isn't -1, I doubt that teaching it that would help your
real queries where it will probably be impossible to determine the
comparison values in advance.

I would suggest considering using NULL rather than inventing a dummy

value for unknown locations.  The estimation heuristics will play a
lot nicer with that choice.

                       regards, tom lane