Re: Select performance variation based on the different combinations of using where lower(), order by, and limit

Kevin Grittner <kgrittn@xxxxxxxxx> · Sun, 18 Aug 2013 20:33:29 -0700 (PDT)

Tyler Reese <jukey91@xxxxxxxxx> wrote:
> Kevin Grittner <kgrittn@xxxxxxxxx> wrote:
>> Tyler Reese <jukey91@xxxxxxxxx> wrote:

>>> mydb=> explain analyze SELECT * FROM "cdr" WHERE lower("CallingPartyNumber") = '9725551212' order by "key" limit 100;
>>> 
>>> Limit  (cost=0.00..72882.05 rows=100 width=757) (actual time=20481.083..30464.960 rows=11 loops=1)
>>>   ->  Index Scan using cdr_pkey on cdr  (cost=0.00..30036152.32 rows=41212 width=757) (actual time=20481.049..30464.686 rows=11 loops=1)
>>>         Filter: (lower("CallingPartyNumber") = '9725551212'::text)
>>> Total runtime: 30465.246 ms
>>
>> It thinks that it will only need to read 1/412th of the table to
>> find 100 matching rows, and using that index it will be able to
>> skip the sort.  Since there aren't 100 matching rows, it has to
>> read the whole table through the index.

> So, since it thinks it needs to read 1/412th of the table is the
> reason why the query planner chooses to use the primary key index
> instead of the callingpartynumber index, like it does in the
> first 3 cases?

The optimizer compares the estimated cost of reading all matching
rows (but only matching rows) out of order and then sorting them to
the estimated cost of reading them in order and filtering out the
rows that don't match (and stopping when the limit is reached). 
Since it though a lot of rows would match, that made the sort look
more expensive and also like it would not reed to read a very large
percentage of the table.

> I'm curious as to why it says "rows=41212".  Is that the estimate
> of the number of rows that meet the filter condition?  Where does
> that come from?

That's based on the distribution observed in the last random sample
when ANALYZE was run (as a command or by autovacuum).  When there
is an error in the estimate that bad, either autovacuum is not
configured to be aggressive enough in analyzing or the stample size
was not large enough.

>>  Raising the statistics target and running ANALYZE might allow
>> it to use a more accurate estimate, and thereby make a better
>> choice.
>
> I haven't heard of raising the statistics target, so I'll read up
> on that.

http://www.postgresql.org/docs/9.2/interactive/runtime-config-query.html#RUNTIME-CONFIG-QUERY-OTHER

http://www.postgresql.org/docs/9.2/interactive/sql-altertable.html

> A few days ago, all 4 cases were responding equally fast. I had
> been messing around with the postgres settings, and I went and
> dropped all of the indexes and recreated them just to see what
> would happen.  I wouldn't think that recreating the indexes would
> cause case 4 to go slow, but that's the symptom I am seeing now.
> Should I be running analyze on a table after it has been
> reindexed?

Only if you have indexes on expressions rather than simple column
names.  If you have an index on lower("CallingPartyNumber") that
would qualify as an expression.

The normal reason to need to get fresh statistics is because of
changes in the distribution of values in a column, particularly
after a bulk load.  Also, columns with a large number of distinct
values tend to benefit from a higher statistics target. The down
sides of higher statistics targets are a longer time to ANALYZE and
increased planning time; so it's generally best to use the default
except where a particular problem has been observed, like in this
case.

-- 
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general