Hi everyone, I'm running 8.4.2 on a CentOS machine, and postgres recently died with signal 6 because the pg_xlog partition filled up (33GB) on 7/4/10 10:34:23 (perfect timing, as I was hiking in the mountains in the remotest parts of our country). I did some digging and found the following: -- The current state of /db/data/pg_xlog (after postgres died on 7/4/10) indicates there are 2056 files, all created between 7/4/10 10:01 and 7/4/10 10:34, and a quick bash-kung-fu check shows that these files were created anywhere from 3 per minute (10:22) to 106 per minute (10:12) -- wal_buffers=8MB; commit_siblings=200; commit_delay=0; checkpoint_segments=16 -- All other WAL-related parameters are either defaults or commented out -- syslog shows that on 7/2/10 from 10:16:11 to 10:21:51, messages similar to this occurred 29 times: > Jul 2 10:16:11 db4.sac postgres[20526]: [56-1] 2010-07-02 10:16:11.117 PDT [user=,db= PID:20526 XID:0]LOG: checkpoints are occurring too frequently (4 seconds apart) > Jul 2 10:16:11 db4.sac postgres[20526]: [56-2] 2010-07-02 10:16:11.117 PDT [user=,db= PID:20526 XID:0]HINT: Consider increasing the configuration parameter "checkpoint_segments". -- On 7/4/10 from 09:09:02 to 09:10:08, the same type of messages occurred 21 times in syslog -- This message did not occur on 7/3/10 at all -- Preceding the "checkpoints are occurring too frequently" syslog entries are autovacuum entries: > Jul 2 10:16:04 db4.sac postgres[11357]: [7-1] 2010-07-02 10:16:04.576 PDT [user=,db= PID:11357 XID:0]LOG: automatic vacuum of table "tii._sac_global.sl_event": index scans: 1 > Jul 2 10:16:04 db4.sac postgres[11357]: [7-2] pages: 0 removed, 25 remain > Jul 2 10:16:04 db4.sac postgres[11357]: [7-3] tuples: 676 removed, 955 remain > Jul 2 10:16:04 db4.sac postgres[11357]: [7-4] system usage: CPU 0.01s/0.00u sec elapsed 0.02 sec > Jul 2 10:16:04 db4.sac postgres[11357]: [8-1] 2010-07-02 10:16:04.580 PDT [user=,db= PID:11357 XID:197431667]LOG: automatic analyze of table "tii._sac_global.sl_event" system usage: CPU 0.00s/0.00u sec elapsed 0.00 sec > Jul 2 10:16:04 db4.sac postgres[11357]: [9-1] 2010-07-02 10:16:04.965 PDT [user=,db= PID:11357 XID:0]LOG: automatic vacuum of table "tii._sac_global.sl_confirm": index scans: 1 > Jul 2 10:16:04 db4.sac postgres[11357]: [9-2] pages: 0 removed, 154 remain > Jul 2 10:16:04 db4.sac postgres[11357]: [9-3] tuples: 1834 removed, 8385 remain > Jul 2 10:16:04 db4.sac postgres[11357]: [9-4] system usage: CPU 0.32s/0.04u sec elapsed 0.37 sec > Jul 4 09:08:56 db4.sac postgres[21798]: [13-1] 2010-07-04 09:08:56.107 PDT [user=,db= PID:21798 XID:0]LOG: automatic vacuum of table "tii.public.city": index scans: 0 > Jul 4 09:08:56 db4.sac postgres[21798]: [13-2] pages: 0 removed, 151 remain > Jul 4 09:08:56 db4.sac postgres[21798]: [13-3] tuples: 0 removed, 20223 remain > Jul 4 09:08:56 db4.sac postgres[21798]: [13-4] system usage: CPU 0.00s/0.00u sec elapsed 0.01 sec > Jul 4 09:08:56 db4.sac postgres[21798]: [14-1] 2010-07-04 09:08:56.118 PDT [user=,db= PID:21798 XID:0]LOG: automatic vacuum of table "tii.public.country": index scans: 1 > Jul 4 09:08:56 db4.sac postgres[21798]: [14-2] pages: 0 removed, 2 remain > Jul 4 09:08:56 db4.sac postgres[21798]: [14-3] tuples: 77 removed, 185 remain > Jul 4 09:08:56 db4.sac postgres[21798]: [14-4] system usage: CPU 0.00s/0.00u sec elapsed 0.00 sec > Jul 4 09:08:56 db4.sac postgres[21798]: [15-1] 2010-07-04 09:08:56.335 PDT [user=,db= PID:21798 XID:0]LOG: automatic vacuum of table "tii.public.gm3_clipboard": index scans: 1 > Jul 4 09:08:56 db4.sac postgres[21798]: [15-2] pages: 0 removed, 897 remain > Jul 4 09:08:56 db4.sac postgres[21798]: [15-3] tuples: 2 removed, 121594 remain > Jul 4 09:08:56 db4.sac postgres[21798]: [15-4] system usage: CPU 0.07s/0.08u sec elapsed 0.20 sec ...snip... > Jul 4 09:10:25 db4.sac postgres[22066]: [23-1] 2010-07-04 09:10:25.921 PDT [user=,db= PID:22066 XID:0]LOG: automatic vacuum of table "tii.public.pm_assignment": index scans: 1 > Jul 4 09:10:25 db4.sac postgres[22066]: [23-2] pages: 0 removed, 995 remain > Jul 4 09:10:25 db4.sac postgres[22066]: [23-3] tuples: 323 removed, 83964 remain > Jul 4 09:10:25 db4.sac postgres[22066]: [23-4] system usage: CPU 0.01s/0.09u sec elapsed 0.52 sec > Jul 4 09:10:25 db4.sac postgres[22073]: [7-1] 2010-07-04 09:10:25.978 PDT [user=,db= PID:22073 XID:0]LOG: automatic vacuum of table "tii.public.pm_question_type": index scans: 0 > Jul 4 09:10:25 db4.sac postgres[22073]: [7-2] pages: 0 removed, 1 remain > Jul 4 09:10:25 db4.sac postgres[22073]: [7-3] tuples: 0 removed, 2 remain > Jul 4 09:10:25 db4.sac postgres[22073]: [7-4] system usage: CPU 0.00s/0.00u sec elapsed 0.00 sec > Jul 4 09:10:26 db4.sac postgres[22066]: [24-1] 2010-07-04 09:10:26.301 PDT [user=,db= PID:22066 XID:0]LOG: automatic vacuum of table "tii.public.pm_association_rule": index scans: 1 > Jul 4 09:10:26 db4.sac postgres[22066]: [24-2] pages: 0 removed, 286 remain > Jul 4 09:10:26 db4.sac postgres[22066]: [24-3] tuples: 46 removed, 51321 remain > Jul 4 09:10:26 db4.sac postgres[22066]: [24-4] system usage: CPU 0.00s/0.03u sec elapsed 0.34 sec > Jul 4 09:10:26 db4.sac postgres[22066]: [25-1] 2010-07-04 09:10:26.328 PDT [user=,db= PID:22066 XID:0]LOG: automatic vacuum of table "tii.public.role": index scans: 0 > Jul 4 09:10:26 db4.sac postgres[22066]: [25-2] pages: 0 removed, 1 remain > Jul 4 09:10:26 db4.sac postgres[22066]: [25-3] tuples: 0 removed, 5 remain > Jul 4 09:10:26 db4.sac postgres[22066]: [25-4] system usage: CPU 0.00s/0.00u sec elapsed 0.00 sec > Jul 4 09:10:26 db4.sac postgres[22066]: [26-1] 2010-07-04 09:10:26.373 PDT [user=,db= PID:22066 XID:0]LOG: automatic vacuum of table "tii.public.pm_review_type": index scans: 0 > Jul 4 09:10:26 db4.sac postgres[22066]: [26-2] pages: 0 removed, 1 remain > Jul 4 09:10:26 db4.sac postgres[22066]: [26-3] tuples: 0 removed, 4 remain > Jul 4 09:10:26 db4.sac postgres[22066]: [26-4] system usage: CPU 0.00s/0.00u sec elapsed 0.00 sec > Jul 4 09:10:26 db4.sac postgres[22066]: [27-1] 2010-07-04 09:10:26.428 PDT [user=,db= PID:22066 XID:0]LOG: automatic vacuum of table "tii.public.permission": index scans: 0 > Jul 4 09:10:26 db4.sac postgres[22066]: [27-2] pages: 0 removed, 1 remain > Jul 4 09:10:26 db4.sac postgres[22066]: [27-3] tuples: 0 removed, 8 remain > Jul 4 09:10:26 db4.sac postgres[22066]: [27-4] system usage: CPU 0.00s/0.00u sec elapsed 0.01 sec -- min_duration_statement=1000, so there is a chance that the volume of really-fast write traffic might be masked, but I'm doubtful as this is our low-usage season (we run a web service for the education sector). -- Another tidbit of information is that I am using slony 2.0.3 for replication, and my sync_interval is 500. However, I'm doubtful that slony traffic is responsible because my other nodes use the exact same config and hardware, and none of them had this issue. -- My last administrative action on this machine was to SIGHUP this machine to push changes to pg_hba.conf on 6/29/10. Just now, I diff-ed my postgresql.conf file with a copy that I keep in svn--no changes at all. This leads me to believe that there was a sudden flurry of write activity that occurred, and the process that would flush WAL files to /db/data/ couldn't keep up, thereby filling up the disk. I'm wondering if anyone else out there might be able to give me some insight or comments to my assessment--is it accurate? Any input would be helpful, and I'll try to make necessary architectural changes to keep this from happening again. Your help is much appreciated! --Richard -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance