Re: Select performance variation based on the different combinations of using where lower(), order by, and limit

Tyler Reese <jukey91@xxxxxxxxx> · Sun, 18 Aug 2013 23:12:55 -0500

On Sun, Aug 18, 2013 at 10:33 PM, Kevin Grittner <kgrittn@xxxxxxxxx> wrote:

Tyler Reese <jukey91@xxxxxxxxx> wrote:

> Kevin Grittner <kgrittn@xxxxxxxxx> wrote:

>> Tyler Reese <jukey91@xxxxxxxxx> wrote:

>>> mydb=> explain analyze SELECT * FROM "cdr" WHERE lower("CallingPartyNumber") = '9725551212' order by "key" limit 100;

>>>

>>> Limit  (cost=0.00..72882.05 rows=100 width=757) (actual time=20481.083..30464.960 rows=11 loops=1)

>>>   ->  Index Scan using cdr_pkey on cdr  (cost=0.00..30036152.32 rows=41212 width=757) (actual time=20481.049..30464.686 rows=11 loops=1)

>>>         Filter: (lower("CallingPartyNumber") = '9725551212'::text)

>>> Total runtime: 30465.246 ms

>>

>> It thinks that it will only need to read 1/412th of the table to

>> find 100 matching rows, and using that index it will be able to

>> skip the sort.  Since there aren't 100 matching rows, it has to

>> read the whole table through the index.

> So, since it thinks it needs to read 1/412th of the table is the

> reason why the query planner chooses to use the primary key index

> instead of the callingpartynumber index, like it does in the

> first 3 cases?

The optimizer compares the estimated cost of reading all matching

rows (but only matching rows) out of order and then sorting them to

the estimated cost of reading them in order and filtering out the

rows that don't match (and stopping when the limit is reached).

Since it though a lot of rows would match, that made the sort look

more expensive and also like it would not reed to read a very large

percentage of the table.

> I'm curious as to why it says "rows=41212".  Is that the estimate

> of the number of rows that meet the filter condition?  Where does

> that come from?

That's based on the distribution observed in the last random sample

when ANALYZE was run (as a command or by autovacuum).  When there

is an error in the estimate that bad, either autovacuum is not

configured to be aggressive enough in analyzing or the stample size

was not large enough.

>>  Raising the statistics target and running ANALYZE might allow

>> it to use a more accurate estimate, and thereby make a better

>> choice.

>

> I haven't heard of raising the statistics target, so I'll read up

> on that.

http://www.postgresql.org/docs/9.2/interactive/runtime-config-query.html#RUNTIME-CONFIG-QUERY-OTHER

http://www.postgresql.org/docs/9.2/interactive/sql-altertable.html

> A few days ago, all 4 cases were responding equally fast. I had

> been messing around with the postgres settings, and I went and

> dropped all of the indexes and recreated them just to see what

> would happen.  I wouldn't think that recreating the indexes would

> cause case 4 to go slow, but that's the symptom I am seeing now.

> Should I be running analyze on a table after it has been

> reindexed?

Only if you have indexes on expressions rather than simple column

names.  If you have an index on lower("CallingPartyNumber") that

would qualify as an _expression_.

The normal reason to need to get fresh statistics is because of

changes in the distribution of values in a column, particularly

after a bulk load.  Also, columns with a large number of distinct

values tend to benefit from a higher statistics target. The down

sides of higher statistics targets are a longer time to ANALYZE and

increased planning time; so it's generally best to use the default

except where a particular problem has been observed, like in this

case.

I performed ANALYZE on the table and now case 4 is fast again:

mydb=> explain analyze SELECT * FROM "cdr" WHERE lower("CallingPartyNumber") = '9725551212' order by "key" limit 100;
                                                                   QUERY PLAN                                                 

------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=958.72..958.97 rows=100 width=758) (actual time=0.251..0.310 rows=11 loops=1)
   ->  Sort  (cost=958.72..959.32 rows=240 width=758) (actual time=0.244..0.266 rows=11 loops=1)
         Sort Key: key
         Sort Method:  quicksort  Memory: 30kB
         ->  Bitmap Heap Scan on cdr  (cost=7.30..949.55 rows=240 width=758) (actual time=0.105..0.162 rows=11 loops=1)
               Recheck Cond: (lower("CallingPartyNumber") = '9725551212'::text)
               ->  Bitmap Index Scan on callingpartynumber_lower  (cost=0.00..7.24 rows=240 width=0) (actual time=0.085..0.085 rows=11 loops=1)
                     Index Cond: (lower("CallingPartyNumber") = '9725551212'::text)
 Total runtime: 0.517 ms
(9 rows)

Thanks for the help, Kevin.