On Mar 13, 2006, at 9:50 AM, Michael Fuhr wrote:
On Sun, Mar 12, 2006 at 11:36:23PM -0800, Casey Duncan wrote:
SELECT count(*) FROM webhits
WHERE path LIKE '/radio/tuner_%.swf' AND status = 200
AND date_recorded >= '3/10/2006'::TIMESTAMP
AND date_recorded < '3/11/2006'::TIMESTAMP;
[...]
Aggregate (cost=794775.08..794775.09 rows=1 width=0)
[...]
According to the planner it should take <15 minutes which is typical
in
practice.
The planner's cost estimate is in units of disk page fetches, not
time. The above estimate isn't 794775.09 ms (~13.25 min) but rather
794775.09 times the cost of a single page fetch, however much that
is. See "Using EXPLAIN" in the "Performance Tips" chapter of the
documentation.
http://www.postgresql.org/docs/8.1/interactive/performance-
tips.html#USING-EXPLAIN
Doh! I should've known that.
About half the times it runs, however, it never terminates
(even after days) and just spins consuming 99+% of CPU with no disk
activity. This query was never a problem in postgres versions < 8.1.2,
however the data has grown substantially since that time. I notice it
uses the recent in-memory bitmap feature, so I wondered if it was
exposing a bug.
If the problem happens half the time then you have a somewhat
repeatable test case. Do you get more consistent performance if
you set enable_bitmapscan to off? What's the query plan if you do
that?
Here's the plan with bitmap scans off:
QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
----------------------
Aggregate (cost=891363.71..891363.72 rows=1 width=0)
-> Index Scan using webhits_date_idx1 on webhits
(cost=0.00..891360.30 rows=1362 width=0)
Index Cond: ((date_recorded >= '2006-03-10
00:00:00'::timestamp without time zone) AND (date_recorded <
'2006-03-11 00:00:00'::timestamp without time zone))
Filter: (((path)::text ~~ '/radio/tuner_%.swf'::text) AND
(status = 200))
(4 rows)
The query runs to completion this way in about 40 minutes. I turned
bitmap scans back on and it hangs again (I ran it for about 5 hours).
If you narrow the search criteria so the query returns fewer rows,
do you still see the problem? Can you identify a "sour spot" where
the problem starts to happen?
I'll do that tomorrow, and let you know.
If I restart the postmaster, the query will complete in the expected
time.
Does the problem eventually start happening again? If so, after
how long? How did you determine that the restart is relevant? Do
you consistently see different (presumably better) performance after
a restart than if you don't restart?
This is a production box, so I can't restart it whenever I want and I
haven't yet reproduced it elsewhere -- the data base size makes that
cumbersome at best -- but once after it hung up, I restarted postgres
and the report ran to completion for a couple of days then started
hanging again. Today it seems to pretty consistently hang, I'll see if
I can restart it overnight and test it again.
-Casey