WAL partition overloaded--by autovacuum?

Richard Yen <richyen@xxxxxxxxxxxxxx> · Tue, 6 Jul 2010 18:10:10 -0700

Hi everyone,

I'm running 8.4.2 on a CentOS machine, and postgres recently died with signal 6 because the pg_xlog partition filled up (33GB) on 7/4/10 10:34:23 (perfect timing, as I was hiking in the mountains in the remotest parts of our country).  I did some digging and found the following:

-- The current state of /db/data/pg_xlog (after postgres died on 7/4/10) indicates there are 2056 files, all created between 7/4/10 10:01 and 7/4/10 10:34, and a quick bash-kung-fu check shows that these files were created anywhere from 3 per minute (10:22) to 106 per minute (10:12)
-- wal_buffers=8MB; commit_siblings=200; commit_delay=0; checkpoint_segments=16
-- All other WAL-related parameters are either defaults or commented out

-- syslog shows that on 7/2/10 from 10:16:11 to 10:21:51, messages similar to this occurred 29 times:
> Jul  2 10:16:11 db4.sac postgres[20526]: [56-1] 2010-07-02 10:16:11.117 PDT [user=,db=  PID:20526 XID:0]LOG:  checkpoints are occurring too frequently (4 seconds apart)
> Jul  2 10:16:11 db4.sac postgres[20526]: [56-2] 2010-07-02 10:16:11.117 PDT [user=,db=  PID:20526 XID:0]HINT:  Consider increasing the configuration parameter "checkpoint_segments".

-- On 7/4/10 from 09:09:02 to 09:10:08, the same type of messages occurred 21 times in syslog
-- This message did not occur on 7/3/10 at all

-- Preceding the "checkpoints are occurring too frequently" syslog entries are autovacuum entries:
> Jul  2 10:16:04 db4.sac postgres[11357]: [7-1] 2010-07-02 10:16:04.576 PDT [user=,db=  PID:11357 XID:0]LOG:  automatic vacuum of table "tii._sac_global.sl_event": index scans: 1
> Jul  2 10:16:04 db4.sac postgres[11357]: [7-2]     pages: 0 removed, 25 remain
> Jul  2 10:16:04 db4.sac postgres[11357]: [7-3]     tuples: 676 removed, 955 remain
> Jul  2 10:16:04 db4.sac postgres[11357]: [7-4]     system usage: CPU 0.01s/0.00u sec elapsed 0.02 sec
> Jul  2 10:16:04 db4.sac postgres[11357]: [8-1] 2010-07-02 10:16:04.580 PDT [user=,db=  PID:11357 XID:197431667]LOG:  automatic analyze of table "tii._sac_global.sl_event" system usage: CPU 0.00s/0.00u sec elapsed 0.00 sec
> Jul  2 10:16:04 db4.sac postgres[11357]: [9-1] 2010-07-02 10:16:04.965 PDT [user=,db=  PID:11357 XID:0]LOG:  automatic vacuum of table "tii._sac_global.sl_confirm": index scans: 1
> Jul  2 10:16:04 db4.sac postgres[11357]: [9-2]     pages: 0 removed, 154 remain
> Jul  2 10:16:04 db4.sac postgres[11357]: [9-3]     tuples: 1834 removed, 8385 remain
> Jul  2 10:16:04 db4.sac postgres[11357]: [9-4]     system usage: CPU 0.32s/0.04u sec elapsed 0.37 sec

> Jul  4 09:08:56 db4.sac postgres[21798]: [13-1] 2010-07-04 09:08:56.107 PDT [user=,db=  PID:21798 XID:0]LOG:  automatic vacuum of table "tii.public.city": index scans: 0
> Jul  4 09:08:56 db4.sac postgres[21798]: [13-2]    pages: 0 removed, 151 remain
> Jul  4 09:08:56 db4.sac postgres[21798]: [13-3]    tuples: 0 removed, 20223 remain
> Jul  4 09:08:56 db4.sac postgres[21798]: [13-4]    system usage: CPU 0.00s/0.00u sec elapsed 0.01 sec
> Jul  4 09:08:56 db4.sac postgres[21798]: [14-1] 2010-07-04 09:08:56.118 PDT [user=,db=  PID:21798 XID:0]LOG:  automatic vacuum of table "tii.public.country": index scans: 1
> Jul  4 09:08:56 db4.sac postgres[21798]: [14-2]    pages: 0 removed, 2 remain
> Jul  4 09:08:56 db4.sac postgres[21798]: [14-3]    tuples: 77 removed, 185 remain
> Jul  4 09:08:56 db4.sac postgres[21798]: [14-4]    system usage: CPU 0.00s/0.00u sec elapsed 0.00 sec
> Jul  4 09:08:56 db4.sac postgres[21798]: [15-1] 2010-07-04 09:08:56.335 PDT [user=,db=  PID:21798 XID:0]LOG:  automatic vacuum of table "tii.public.gm3_clipboard": index scans: 1
> Jul  4 09:08:56 db4.sac postgres[21798]: [15-2]    pages: 0 removed, 897 remain
> Jul  4 09:08:56 db4.sac postgres[21798]: [15-3]    tuples: 2 removed, 121594 remain
> Jul  4 09:08:56 db4.sac postgres[21798]: [15-4]    system usage: CPU 0.07s/0.08u sec elapsed 0.20 sec
...snip...
> Jul  4 09:10:25 db4.sac postgres[22066]: [23-1] 2010-07-04 09:10:25.921 PDT [user=,db=  PID:22066 XID:0]LOG:  automatic vacuum of table "tii.public.pm_assignment": index scans: 1
> Jul  4 09:10:25 db4.sac postgres[22066]: [23-2]    pages: 0 removed, 995 remain
> Jul  4 09:10:25 db4.sac postgres[22066]: [23-3]    tuples: 323 removed, 83964 remain
> Jul  4 09:10:25 db4.sac postgres[22066]: [23-4]    system usage: CPU 0.01s/0.09u sec elapsed 0.52 sec
> Jul  4 09:10:25 db4.sac postgres[22073]: [7-1] 2010-07-04 09:10:25.978 PDT [user=,db=  PID:22073 XID:0]LOG:  automatic vacuum of table "tii.public.pm_question_type": index scans: 0
> Jul  4 09:10:25 db4.sac postgres[22073]: [7-2]     pages: 0 removed, 1 remain
> Jul  4 09:10:25 db4.sac postgres[22073]: [7-3]     tuples: 0 removed, 2 remain
> Jul  4 09:10:25 db4.sac postgres[22073]: [7-4]     system usage: CPU 0.00s/0.00u sec elapsed 0.00 sec
> Jul  4 09:10:26 db4.sac postgres[22066]: [24-1] 2010-07-04 09:10:26.301 PDT [user=,db=  PID:22066 XID:0]LOG:  automatic vacuum of table "tii.public.pm_association_rule": index scans: 1
> Jul  4 09:10:26 db4.sac postgres[22066]: [24-2]    pages: 0 removed, 286 remain
> Jul  4 09:10:26 db4.sac postgres[22066]: [24-3]    tuples: 46 removed, 51321 remain
> Jul  4 09:10:26 db4.sac postgres[22066]: [24-4]    system usage: CPU 0.00s/0.03u sec elapsed 0.34 sec
> Jul  4 09:10:26 db4.sac postgres[22066]: [25-1] 2010-07-04 09:10:26.328 PDT [user=,db=  PID:22066 XID:0]LOG:  automatic vacuum of table "tii.public.role": index scans: 0
> Jul  4 09:10:26 db4.sac postgres[22066]: [25-2]    pages: 0 removed, 1 remain
> Jul  4 09:10:26 db4.sac postgres[22066]: [25-3]    tuples: 0 removed, 5 remain
> Jul  4 09:10:26 db4.sac postgres[22066]: [25-4]    system usage: CPU 0.00s/0.00u sec elapsed 0.00 sec
> Jul  4 09:10:26 db4.sac postgres[22066]: [26-1] 2010-07-04 09:10:26.373 PDT [user=,db=  PID:22066 XID:0]LOG:  automatic vacuum of table "tii.public.pm_review_type": index scans: 0
> Jul  4 09:10:26 db4.sac postgres[22066]: [26-2]    pages: 0 removed, 1 remain
> Jul  4 09:10:26 db4.sac postgres[22066]: [26-3]    tuples: 0 removed, 4 remain
> Jul  4 09:10:26 db4.sac postgres[22066]: [26-4]    system usage: CPU 0.00s/0.00u sec elapsed 0.00 sec
> Jul  4 09:10:26 db4.sac postgres[22066]: [27-1] 2010-07-04 09:10:26.428 PDT [user=,db=  PID:22066 XID:0]LOG:  automatic vacuum of table "tii.public.permission": index scans: 0
> Jul  4 09:10:26 db4.sac postgres[22066]: [27-2]    pages: 0 removed, 1 remain
> Jul  4 09:10:26 db4.sac postgres[22066]: [27-3]    tuples: 0 removed, 8 remain
> Jul  4 09:10:26 db4.sac postgres[22066]: [27-4]    system usage: CPU 0.00s/0.00u sec elapsed 0.01 sec

-- min_duration_statement=1000, so there is a chance that the volume of really-fast write traffic might be masked, but I'm doubtful as this is our low-usage season (we run a web service for the education sector).

-- Another tidbit of information is that I am using slony 2.0.3 for replication, and my sync_interval is 500.  However, I'm doubtful that slony traffic is responsible because my other nodes use the exact same config and hardware, and none of them had this issue.

-- My last administrative action on this machine was to SIGHUP this machine to push changes to pg_hba.conf on 6/29/10.  Just now, I diff-ed my postgresql.conf file with a copy that I keep in svn--no changes at all.

This leads me to believe that there was a sudden flurry of write activity that occurred, and the process that would flush WAL files to /db/data/ couldn't keep up, thereby filling up the disk.  I'm wondering if anyone else out there might be able to give me some insight or comments to my assessment--is it accurate?  Any input would be helpful, and I'll try to make necessary architectural changes to keep this from happening again.

Your help is much appreciated!
--Richard
-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance