Hi, (I have put the contents of this email at http://pastebin.com/6h49zrYE for better readibility) I have been circling around the following problem for a few hours and I am wondering if anyone has any suggestions about why the the second query is taking so much longer than the first. Both tables have similar row counts, all tables are partitions, though the queries are querying the partitions directly, not the parent tables. The 'DELIVERED' status comprises almost 100% of the rows in the notification_xxx tables. ANALYZE was run right before each query was issued. This test database is on EC2. PG 9.2.4, CentOS release 5.9, 70GB RAM. GUC changes listed at the bottom. Are there any other details I can supply? Note that the same behaviour occurred on 9.2.3, but I cannot say for sure for whether it occurred on 9.2.2. Dropping the random_page_cost to 1 helped the slow query (after it is cached, drops down to about 1.7s). Setting enable_mergejoin to false causes the slow query to choose a Nested Loop and shaves off about 30% of the execution time. -- Fast query, first run, uncached, ORDER BY nearly-unique tstamp_utc explain (analyze, buffers) SELECT e.id,e.device,e.engine,e.form,e.device_type_id,ncbs.delivery_tstamp_utc FROM event_20130404 e INNER JOIN notifications_20130404 ncbs ON (e.id = ncbs.event_id) WHERE TRUE AND e.date_utc = '2013-04-04' AND e.tstamp_utc BETWEEN '2013-04-04 10:00:00' AND '2013-04-04 18:00:00' AND e.org_id = 216471 AND ncbs.event_creation_tstamp_utc BETWEEN '2013-04-04 10:00:00' AND '2013-04-04 18:00:00' AND ncbs.status = 'DELIVERED' ORDER BY e.tstamp_utc desc offset 10000 limit 100; Limit (cost=29639.24..29935.63 rows=100 width=50) (actual time=159.103..167.637 rows=100 loops=1) Buffers: shared hit=46588 read=1055 I/O Timings: read=63.416 -> Nested Loop (cost=0.00..1882500.73 rows=635138 width=50) (actual time=0.240..159.693 rows=10100 loops=1) Buffers: shared hit=46588 read=1055 I/O Timings: read=63.416 -> Index Scan Backward using event_20130404_tstamp_utc_org_id_idx on event_20130404 e (cost=0.00..315412.34 rows=1876877 width=42) (actual time=0.129..61.981 rows=10100 loops=1) Index Cond: ((tstamp_utc >= '2013-04-04 10:00:00'::timestamp without time zone) AND (tstamp_utc <= '2013-04-04 18:00:00'::timestamp without time zone) AND (org_id = 216471)) Filter: (date_utc = '2013-04-04'::date) Buffers: shared hit=6309 read=833 I/O Timings: read=40.380 -> Index Scan using notifications_20130404_event_id_org_id_pk on notifications_20130404 ncbs (cost=0.00..0.82 rows=1 width=16) (actual time=0.006..0.006 rows=1 loops=10100) Index Cond: (event_id = e.id) Filter: ((event_creation_tstamp_utc >= '2013-04-04 10:00:00'::timestamp without time zone) AND (event_creation_tstamp_utc <= '2013-04-04 18:00:00'::timestamp without time zone) AND (status = 'DELIVERED'::text)) Buffers: shared hit=40279 read=222 I/O Timings: read=23.036 Total runtime: 170.436 ms (17 rows) -- Slow query, uncached, ORDER BY primary key explain (analyze,buffers) SELECT e.device,e.id,e.engine,e.form,e.device_type_id,ncbs.delivery_tstamp_utc FROM event_20130405 e INNER JOIN notifications_20130405 ncbs ON (e.id = ncbs.event_id) WHERE TRUE AND e.date_utc = '2013-04-05' AND e.tstamp_utc BETWEEN '2013-04-05 10:00:00' AND '2013-04-05 18:00:00' AND e.org_id = 216471 AND ncbs.event_creation_tstamp_utc BETWEEN '2013-04-05 10:00:00' AND '2013-04-05 18:00:00' AND ncbs.status = 'DELIVERED' ORDER BY e.id desc OFFSET 10000 LIMIT 100; Limit (cost=14305.61..14447.99 rows=100 width=42) (actual time=13949.028..13950.353 rows=100 loops=1) Buffers: shared hit=2215465 read=141796 written=12128 I/O Timings: read=11308.341 write=116.673 -> Merge Join (cost=67.31..887063.35 rows=622965 width=42) (actual time=13761.135..13933.879 rows=10100 loops=1) Merge Cond: (e.id = ncbs.event_id) Buffers: shared hit=2215465 read=141796 written=12128 I/O Timings: read=11308.341 write=116.673 -> Index Scan Backward using event_20130405_id_pk on event_20130405 e (cost=0.00..612732.34 rows=1889715 width=34) (actual time=2076.812..2111.274 rows=10100 loops=1) Filter: ((tstamp_utc >= '2013-04-05 10:00:00'::timestamp without time zone) AND (tstamp_utc <= '2013-04-05 18:00:00'::timestamp without time zone) AND (date_utc = '2013-04-05'::date) AND (org_id = 216471)) Rows Removed by Filter: 1621564 Buffers: shared hit=1176391 read=113344 written=11918 I/O Timings: read=774.769 write=113.095 -> Index Scan Backward using notifications_20130405_event_id_org_id_pk on notifications_20130405 ncbs (cost=0.00..258079.61 rows=2135847 width=16) (actual time=11684.312..11784.738 rows=11653 loops=1) Filter: ((event_creation_tstamp_utc >= '2013-04-05 10:00:00'::timestamp without time zone) AND (event_creation_tstamp_utc <= '2013-04-05 18:00:00'::timestamp without time zone) AND (status = 'DELIVERED'::text)) Rows Removed by Filter: 1620011 Buffers: shared hit=1039074 read=28452 written=210 I/O Timings: read=10533.572 write=3.578 Total runtime: 13950.458 ms (18 rows) select * from admin.changed_guc_settings ; name | current_setting ---------------------------------+--------------------------------------------------------------------------------------------------------------- version | PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-52), 64-bit autovacuum_analyze_scale_factor | 0.01 autovacuum_naptime | 20s autovacuum_vacuum_cost_delay | 5ms autovacuum_vacuum_cost_limit | 1000 autovacuum_vacuum_scale_factor | 0.01 autovacuum_vacuum_threshold | 50 bgwriter_lru_maxpages | 400 checkpoint_completion_target | 0.8 checkpoint_segments | 64 client_encoding | UTF8 commit_delay | 10 effective_cache_size | 54GB effective_io_concurrency | 4 lc_collate | en_US.UTF-8 lc_ctype | en_US.UTF-8 log_autovacuum_min_duration | 0 log_filename | postgresql-%Y-%m-%d.log log_line_prefix | %t [%p]: [%l-1] (user=%u) (rhost=%h) (db=%d) log_lock_waits | on log_min_duration_statement | 100ms log_temp_files | 0 logging_collector | on maintenance_work_mem | 6GB max_connections | 200 max_locks_per_transaction | 100 max_stack_depth | 6MB pg_stat_statements.max | 1000 pg_stat_statements.track | all random_page_cost | 1 search_path | "$user",public,admin server_encoding | UTF8 shared_buffers | 6GB shared_preload_libraries | pg_stat_statements track_activity_query_size | 6144 track_functions | pl track_io_timing | on vacuum_cost_limit | 2000 wal_buffers | 16MB wal_keep_segments | 100 wal_level | minimal work_mem | 900MB test01=# \d event_20130406 Table "public.event_20130406" Column | Type | Modifiers -----------------------+-----------------------------+------------------------------------------------------------------- id | bigint | not null default nextval('event_id_seq'::regclass) org_id | integer | not null date_utc | date | not null default (timezone('UTC'::text, clock_timestamp()))::date tstamp_utc | timestamp without time zone | not null default timezone('UTC'::text, clock_timestamp()) device | text | not null provider | text | not null form | text | engine | text | user_service_provider | text | sender | text | properties | json | delivery_windows | json | device_type_id | smallint | Indexes: "event_20130406_id_pk" PRIMARY KEY, btree (id) WITH (fillfactor=100) "event_20130406_date_utc_idx" btree (date_utc) WITH (fillfactor=100) "event_20130406_device_type_id_idx" btree (device_type_id) WITH (fillfactor=100) "event_20130406_org_id_idx" btree (org_id) WITH (fillfactor=100) "event_20130406_tstamp_utc_org_id_idx" btree (tstamp_utc, org_id) WITH (fillfactor=100) Check constraints: "event_20130406_date_utc_check" CHECK (date_utc = '2013-04-06'::date) test01=# \d notification_counts_by_status_20130406 Table "public.notification_counts_by_status_20130406" Column | Type | Modifiers ---------------------------+-----------------------------+----------- event_id | bigint | not null org_id | integer | not null status | text | not null tstamp_utc | timestamp without time zone | not null event_creation_tstamp_utc | timestamp without time zone | not null delivery_tstamp_utc | timestamp without time zone | Indexes: "notification_counts_by_status_20130406_event_id_org_id_pk" PRIMARY KEY, btree (event_id, org_id) "ncbs_20120406_delivery_tstamp_utc_pidx" btree (delivery_tstamp_utc) WHERE delivery_tstamp_utc IS NOT NULL "notification_counts_by_status_20130406_ectu_idx" btree (event_creation_tstamp_utc) WITH (fillfactor=66) "notification_counts_by_status_20130406_org_id_idx" btree (org_id) WITH (fillfactor=98) "notification_counts_by_status_20130406_status_idx" btree (status) WITH (fillfactor=66) Check constraints: "ncbs_20130406_ectu_ck" CHECK (event_creation_tstamp_utc >= '2013-04-06 00:00:00'::timestamp without time zone AND event_creation_tstamp_utc < '2013-04-07 00:00:00'::timestamp without time zone) Inherits: notification_counts_by_status -- Row counts, in case that is useful test01=# select count(*) from event_20130404: 6,354,940 test01=# select count(*) from event_20130404 where tstamp_utc BETWEEN '2013-04-04 10:00:00' AND '2013-04-04 18:00:00'; 2,157,002 test01=# select count(*) from event_20130405 where tstamp_utc BETWEEN '2013-04-05 10:00:00' AND '2013-04-05 18:00:00'; 2,160,009 test01=# select count(*) from event_20130405; 6,479,278 test01=# select count(*) from notification_counts_by_status_20130404 where event_creation_tstamp_utc BETWEEN '2013-04-04 10:00:00' AND '2013-04-04 18:00:00'; 2,157,000 test01=# select count(*) from notification_counts_by_status_20130405 where event_creation_tstamp_utc BETWEEN '2013-04-05 10:00:00' AND '2013-04-05 18:00:00'; 2,160,009 -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance