Server hanging with high CPU wait -- Checkpoint issue?

George Woodring <george.woodring@xxxxxxxxxx> · Mon, 3 Feb 2014 14:16:55 -0500

I am running 9.3.2 on CentOS 6.5
We have a new server we are migrating to that seems to lock up for up to 30sec at a time with all of the processes in wait.  I am seeing a weird output with the checkpoint logs and am wondering if I need to adjust something with the checkpointer config.

My checkpoint settings are:

checkpoint_completion_target = 0.9
checkpoint_timeout = 12min
checkpoint_segments = 50		# in logfile segments, min 1, 16MB each
checkpoint_warning = 300s		# 0 disables

When the box is running will, the checkpoint logs look like the following, checkpoints restarting every 12 minutes and it completing in about 10 minutes.

Feb  3 11:00:09 sponge07b postgres[2754]: [2651-1] LOG:  checkpoint starting: time
Feb  3 11:10:57 sponge07b postgres[2754]: [2652-1] LOG:  checkpoint complete: wrote 32352 buffers (6.2%); 0 transaction log file(s) added, 0 removed, 14 recycled; write=647.749 s, sync=0.110 s, total=647.908 s; sync files=424, longest=0.017 s, average=0.000 s
Feb  3 11:12:09 sponge07b postgres[2754]: [2653-1] LOG:  checkpoint starting: time
Feb  3 11:22:57 sponge07b postgres[2754]: [2654-1] LOG:  checkpoint complete: wrote 33459 buffers (6.4%); 0 transaction log file(s) added, 0 removed, 14 recycled; write=647.737 s, sync=0.146 s, total=647.967 s; sync files=441, longest=0.017 s, average=0.000 s
Feb  3 11:24:09 sponge07b postgres[2754]: [2655-1] LOG:  checkpoint starting: time
Feb  3 11:34:57 sponge07b postgres[2754]: [2656-1] LOG:  checkpoint complete: wrote 33397 buffers (6.4%); 0 transaction log file(s) added, 0 removed, 15 recycled; write=647.630 s, sync=0.085 s, total=647.788 s; sync files=448, longest=0.009 s, average=0.000 s
Feb  3 11:36:09 sponge07b postgres[2754]: [2657-1] LOG:  checkpoint starting: time
Feb  3 11:46:57 sponge07b postgres[2754]: [2658-1] LOG:  checkpoint complete: wrote 32569 buffers (6.2%); 0 transaction log file(s) added, 0 removed, 15 recycled; write=647.633 s, sync=0.118 s, total=647.806 s; sync files=458, longest=0.014 s, average=0.000 s
Feb  3 11:48:09 sponge07b postgres[2754]: [2659-1] LOG:  checkpoint starting: time
Feb  3 11:58:57 sponge07b postgres[2754]: [2660-1] LOG:  checkpoint complete: wrote 34068 buffers (6.5%); 0 transaction log file(s) added, 0 removed, 15 recycled; write=647.739 s, sync=0.129 s, total=647.911 s; sync files=428, longest=0.019 s, average=0.000 s

When we are having the issue just now, the checkpoint logs look like the following.  Checkpoints running every 12 minutes, but the checkpoint is written after 2-3 minutes

Feb  3 12:24:09 sponge07b postgres[2754]: [2665-1] LOG:  checkpoint starting: time
Feb  3 12:27:43 sponge07b postgres[2754]: [2666-1] LOG:  checkpoint complete: wrote 10127 buffers (1.9%); 0 transaction log file(s) added, 4 removed, 28 recycled; write=197.510 s, sync=15.975 s, total=213.746 s; sync files=440, longest=9.089 s, average=0.022 s
Feb  3 12:36:09 sponge07b postgres[2754]: [2667-1] LOG:  checkpoint starting: time
Feb  3 12:39:01 sponge07b postgres[2754]: [2668-1] LOG:  checkpoint complete: wrote 7540 buffers (1.4%); 0 transaction log file(s) added, 0 removed, 15 recycled; write=168.468 s, sync=3.899 s, total=172.425 s; sync files=450, longest=2.261 s, average=0.008 s
Feb  3 12:48:09 sponge07b postgres[2754]: [2669-1] LOG:  checkpoint starting: time
Feb  3 12:53:57 sponge07b postgres[2754]: [2670-1] LOG:  checkpoint complete: wrote 9990 buffers (1.9%); 0 transaction log file(s) added, 0 removed, 16 recycled; write=340.182 s, sync=7.443 s, total=347.684 s; sync files=472, longest=2.590 s, average=0.015 s
Feb  3 13:00:09 sponge07b postgres[2754]: [2671-1] LOG:  checkpoint starting: time
Feb  3 13:04:07 sponge07b postgres[2754]: [2672-1] LOG:  checkpoint complete: wrote 5769 buffers (1.1%); 0 transaction log file(s) added, 0 removed, 16 recycled; write=234.841 s, sync=3.164 s, total=238.060 s; sync files=444, longest=2.323 s, average=0.007 s
Feb  3 13:12:09 sponge07b postgres[2754]: [2673-1] LOG:  checkpoint starting: time
Feb  3 13:14:01 sponge07b postgres[2754]: [2674-1] LOG:  checkpoint complete: wrote 5097 buffers (1.0%); 0 transaction log file(s) added, 0 removed, 15 recycled; write=108.781 s, sync=3.100 s, total=112.037 s; sync files=430, longest=2.365 s, average=0.007 s
Feb  3 13:24:10 sponge07b postgres[2754]: [2675-1] LOG:  checkpoint starting: time
Feb  3 13:29:18 sponge07b postgres[2754]: [2676-1] LOG:  checkpoint complete: wrote 3276 buffers (0.6%); 0 transaction log file(s) added, 0 removed, 16 recycled; write=303.894 s, sync=0.383 s, total=308.004 s; sync files=464, longest=0.274 s, average=0.000 s
Feb  3 13:36:10 sponge07b postgres[2754]: [2677-1] LOG:  checkpoint starting: time
Feb  3 13:38:56 sponge07b postgres[2754]: [2678-1] LOG:  checkpoint complete: wrote 5443 buffers (1.0%); 0 transaction log file(s) added, 0 removed, 16 recycled; write=166.440 s, sync=0.108 s, total=166.605 s; sync files=426, longest=0.008 s, average=0.000 s

Could our issue be having to do a lot of cleanup before the next checkpoint start causing the box to go into the CPU wait state?

Thanks,
George

iGLASS Networks

www.iglass.net