Hi, running Postgres 9.1.1 on an EC2 m1.xlarge instance. Machine is a dedicated master with 2 streaming replication nodes. The machine has 16GB of RAM and 4 cores. We're starting to see some slow queries, especially COMMITs that are happening more frequently. The slow queries are against seemingly well-indexed tables. I have log_min_duration = 150ms Slow commits like: 2011-11-14 17:47:11 UTC pid:14366 (44/0-0) LOG: duration: 3062.784 ms statement: COMMIT 2011-11-14 17:47:11 UTC pid:14604 (48/0-0) LOG: duration: 2593.351 ms statement: COMMIT These slow COMMITs are against tables that received a large number of UPDATEs and are growing fairly rapidly. And slow queries like: 2011-11-14 17:50:20 UTC pid:6519 (16/0-0) LOG: duration: 1694.456 ms statement: SELECT "facebook_wall_posts".* FROM "facebook_wall_posts" WHERE "facebook_wall_posts"."token" = '984c44e75975b224b38197cf8f8fc76a' LIMIT 1 query plan: http://explain.depesz.com/s/wbm The # of rows in facebook_wall_posts is 5841 Structure of facebook_wall_posts: Table "public.facebook_wall_posts" Column | Type | Modifiers --------------------+-----------------------------+------------------------------------------------------------------ id | integer | not null default nextval('facebook_wall_posts_id_seq'::regclass) album_id | integer | not null token | character varying(32) | not null fb_recipient_id | character varying(64) | not null post_id | character varying(100) | not null fb_post_created_at | timestamp without time zone | data | text | created_at | timestamp without time zone | updated_at | timestamp without time zone | fb_post_deleted_at | timestamp without time zone | Indexes: "facebook_wall_posts_pkey" PRIMARY KEY, btree (id) "index_facebook_wall_posts_on_token" UNIQUE, btree (token) "index_facebook_wall_posts_on_album_id" btree (album_id) And another slow query: 2011-11-14 17:52:44 UTC pid:14912 (58/0-0) LOG: duration: 979.740 ms statement: SELECT "facebook_friends"."friend_id" FROM "facebook_friends" WHERE "facebook_friends"."user_id" = 9134671 Query plan: http://explain.depesz.com/s/x1q # of rows in facebook_friends is 27075088 Structure of facebook_friends: Table "public.facebook_friends" Column | Type | Modifiers ------------+-----------------------------+--------------------------------------------------------------- id | integer | not null default nextval('facebook_friends_id_seq'::regclass) user_id | integer | not null friend_id | integer | not null created_at | timestamp without time zone | Indexes: "facebook_friends_pkey" PRIMARY KEY, btree (id) "index_facebook_friends_on_user_id_and_friend_id" UNIQUE, btree (user_id, friend_id) We have auto-vacuum enabled and running. But yesterday I manually ran vacuum on the database. Autovacuum settings: autovacuum | on autovacuum_analyze_scale_factor | 0.1 autovacuum_analyze_threshold | 50 autovacuum_freeze_max_age | 200000000 autovacuum_max_workers | 3 autovacuum_naptime | 60 autovacuum_vacuum_cost_delay | 20 autovacuum_vacuum_cost_limit | -1 autovacuum_vacuum_scale_factor | 0.2 autovacuum_vacuum_threshold | 50 other postgresql.conf settings: shared_buffers = 3584MB wal_buffers = 16MB checkpoint_segments = 32 max_wal_senders = 10 checkpoint_completion_target = 0.9 wal_keep_segments = 1024 maintenance_work_mem = 256MB work_mem = 88MB shared_buffers = 3584MB effective_cache_size = 10GB The PGDATA dir is a RAID10 on 4 local ("ephemeral" in EC2 speak) drives. I ran some dd tests and received the following output: --- WRITING --- root@sql03:/data# time sh -c "dd if=/dev/zero of=/data/tmp/bigfile bs=8k count=4000000 && sync" 4000000+0 records in 4000000+0 records out 32768000000 bytes (33 GB) copied, 670.663 s, 48.9 MB/s real 11m52.199s user 0m2.720s sys 0m45.330s --- READING --- root@sql03:/data# time dd of=/dev/zero if=/data/tmp/bigfile bs=8k 4000000+0 records in 4000000+0 records out 32768000000 bytes (33 GB) copied, 155.429 s, 211 MB/s real 2m35.434s user 0m2.400s sys 0m33.160s I have enabled log_checkpoints and here is a recent sample from the log: 2011-11-14 17:38:48 UTC pid:3965 (-0) LOG: checkpoint complete: wrote 15121 buffers (3.3%); 0 transaction log file(s) added, 0 removed, 8 recycled; write=270.101 s, sync=2.989 s, total=273.112 s; sync files=60, longest=1.484 s, average=0.049 s 2011-11-14 17:39:15 UTC pid:3965 (-0) LOG: checkpoint starting: time 2011-11-14 17:43:49 UTC pid:3965 (-0) LOG: checkpoint complete: wrote 16462 buffers (3.6%); 0 transaction log file(s) added, 0 removed, 9 recycled; write=269.978 s, sync=4.106 s, total=274.117 s; sync files=82, longest=2.943 s, average=0.050 s 2011-11-14 17:44:15 UTC pid:3965 (-0) LOG: checkpoint starting: time 2011-11-14 17:48:47 UTC pid:3965 (-0) LOG: checkpoint complete: wrote 14159 buffers (3.1%); 0 transaction log file(s) added, 0 removed, 6 recycled; write=269.818 s, sync=2.119 s, total=271.948 s; sync files=71, longest=1.192 s, average=0.029 s 2011-11-14 17:49:15 UTC pid:3965 (-0) LOG: checkpoint starting: time 2011-11-14 17:53:47 UTC pid:3965 (-0) LOG: checkpoint complete: wrote 11337 buffers (2.5%); 0 transaction log file(s) added, 0 removed, 7 recycled; write=269.901 s, sync=2.508 s, total=272.419 s; sync files=71, longest=1.867 s, average=0.035 s 2011-11-14 17:54:15 UTC pid:3965 (-0) LOG: checkpoint starting: time 2011-11-14 17:58:48 UTC pid:3965 (-0) LOG: checkpoint complete: wrote 15706 buffers (3.4%); 0 transaction log file(s) added, 0 removed, 7 recycled; write=270.104 s, sync=3.612 s, total=273.727 s; sync files=67, longest=3.051 s, average=0.053 s 2011-11-14 17:59:15 UTC pid:3965 (-0) LOG: checkpoint starting: time I've been collecting random samples from pg_stat_bgwriter: https://gist.github.com/4faec2ca9a79ede281e1 So given all this information (if you need more just let me know), is there something fundamentally wrong or mis-configured? Do I have an I/O issue? Thanks for any insight. -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance