Hi all,
We've recently started having a problem where a query that normally executes in ~15ms starts to take upwards of 20s to complete. When the connection that ran query is returned to the connection pool, it appears as though a transaction is still in progress so the connection pool tries to cancel the transaction and close the connection. This fails and the connection is removed from the connection pool. At this point, the situation rapidly degrades and we run out of connections to the postgres server.
An inspection of the pg_stat_activity table shows that practically every connection is running the above-mentioned query and some of those queries have been active for many minutes! We've looked at the pg_locks table as well and the only exclusive locks are on transactions that are open. All other locks are AccessShareLocks. Also, as far as we can tell (from looking at the Hibernate stats), every db session that is opened is closed.
When this happens, if I kill one of the running postgres processes (just by picking the last process returned from "ps -ef | grep postgres"), the other queries will immediately finish and the system will respond. However, within 15 minutes, we'll be back in the same state as before. At that point, I've cycled Apache, Tomcat and Postgres and the system then seems to come back.
This problem appears to be unrelated to load and in fact, the majority of the time there is very little load on the site when this occurs. We've run load tests on our dev boxes but we've been unable to reproduce the problem. We're currently working on playing back the clicks on the site previous to the weird state the site gets in and at the same time, we were wondering if anyone has experienced a problem like this or has any suggestions.
The query in question is:
select distinct s.screening_id, f.film_id, f.title, s.period_start, f.runtime, c.value, v.short_name, s.parent_id,
stats.avg_rating, coalesce(stats.num_adds, 0) as num_adds, coalesce(stats.unique_visits, 0) as unique_visits,
f.*, s.*
from lte_screening s
inner join lte_film f on s.film_id = f.film_id
inner join lte_venue v on
s.venue_id = v.venue_id
inner join lte_film_classifier c on c.film_id = f.film_id
left join lte_film_stats stats on stats.context = :context and stats.film_id = s.film_id
where c.name = ? and s.period_start is not null and s.festival_id = ?
and s.period_start
between ? + ? and ? + ?
order by s.period_start, f.title;
And the result of explain analyze:
QUERY PLAN
Unique (cost=1117.42..1118.71 rows=11 width=866) (actual time=
18.306..18.386 rows=15 loops=1)
-> Sort (cost=1117.42..1117.44 rows=11 width=866) (actual time=18.300..18.316 rows=15 loops=1)
Sort Key: s.period_start, f.title, s.screening_id, f.film_id, f.runtime, c.value, v.short_name, s.parent_id,
stats.avg_rating, COALESCE(stats.num_adds, 0), COALESCE(stats.unique_visits, 0::bigint), f.film_id, f.sku, f.title, f."template", f.release_date, f.runtime, f."language", f.country, f.mpaa_rating, f.synopsis
, f."owner", f.ext_sales_rank, f.small_image_url, f.medium_image_url, f.large_image_url, f.detail_page, f.to_delete, f.coalesce_to, (subplan), (subplan), s.screening_id, s.period_start, s.period_end, s.ticket_price
, s.tickets_avail, s.tickets_sold, s."type", s.venue_id, s.festival_id, s.film_id, s.parent_id, s.ext_id, s.purchase_url, s.status, s.status_update_time
-> Nested Loop Left Join (cost=2.62..1117.23 rows=11 width=866) (actual time=2.656..17.773 rows=15 loops=1)
-> Nested Loop (cost=2.62..976.00 rows=11 width=846) (actual time=2.347..16.162 rows=15 loops=1)
-> Hash Join (cost=
2.62..929.09 rows=10 width=831) (actual time=2.217..15.480 rows=15 loops=1)
Hash Cond: ("outer".venue_id = "inner".venue_id)
-> Nested Loop (cost=0.00..926.32 rows=10 width=818) (actual time=1.915..14.974 rows=15 loops=1)
-> Seq Scan on lte_screening s (cost=0.00..886.67 rows=10 width=159) (actual time=
1.830..14.314 rows=15 loops=1)
Filter: ((period_start IS NOT NULL) AND (festival_id = 316372) AND (period_start >= '2006-05-19 05:00:00'::timestamp without time zone) AND (period_start <= '2006-05-20 04:59:59'::timestamp without time zone))
-> Index Scan using lte_film_pkey on lte_film f (cost=0.00..3.95 rows=1 width=659) (actual time=
0.026..0.028 rows=1 loops=15)
Index Cond: ("outer".film_id = f.film_id
)
-> Hash (cost=2.50..2.50 rows=50 width=21) (actual time=0.215..0.215 rows=0 loops=1)
-> Seq Scan on lte_venue v (cost=0.00..2.50 rows=50 width=21) (actual time=0.012..0.126
rows=52 loops=1)
-> Index Scan using idx_classifier_film on lte_film_classifier c (cost=0.00..4.67
rows=2 width=23) (actual time=0.026..0.028 rows=1 loops=15)
Index Cond: (c.film_id = "outer".film_id)
Filter: ((name)::text = 'FestivalCategory'::text)
-> Index Scan using lte_film_stats_pkey on lte_film_stats stats (cost=0.00..4.34 rows=1 width=28) (actual time=0.034..0.037 rows=1 loops=15)
Index Cond: ((stats.context = 316372) AND (stats.film_id = "outer".film_id))
SubPlan
-> Index Scan using idx_collateral_film on lte_film_collateral c (cost=0.00..4.24
rows=1 width=40) (actual time=0.009..0.011 rows=1 loops=15)
Index Cond: (film_id = $0)
Filter: ((name)::text = 'TVRating'::text)
-> Index Scan using idx_collateral_film on lte_film_collateral c (cost=
0.00..4.24 rows=1 width=40) (actual time=0.022..0.025 rows=1 loops=15)
Index Cond: (film_id = $0)
Filter: ((name)::text = 'IMDBId'::text)
Total runtime: 19.077 ms
Here is our setup:
We have 2 machines. The first is the web server and the db server and the second is just another web server:
Machine A
- 1 GB RAM
- 1 Intel(R) Xeon(TM) CPU 2.80GHz HyperThreaded Processor
- CentOS 4.3
- Linux moe 2.6.9-22.ELsmp #1 SMP Sat Oct 8 19:11:43 CDT 2005 i686 i686 i386 GNU/Linux
Machine B
- 1 GB RAM
- 1 Intel(R) Xeon(TM) CPU 2.80GHz Processor
- CentOS 4.3
- Linux larry 2.6.9-22.0.1.EL #1 Thu Oct 27 12:26:11 CDT 2005 i686 i686 i386 GNU/Linux
We're using the following software:
- Apache 2.0.52
- Tomcat 5.5.17
- Postgres 8.0.6
- JDK 1.5.0-Release 6
- Proxool
0.8.3
- Hibernate 3.1.3
Thanks in advance for any help,
Meetesh