Re: checkpoint spikes

Janning <ml@xxxxxxxxxxxx> · Thu, 10 Jun 2010 18:49:45 +0200

Hi again,

nobody answered my question :-(, so i did some research. I was convinced to 
set:

checkpoint_segments = 16
checkpoint_timeout = 60min

and 
echo 2 > /proc/sys/vm/dirty_background_ratio

this helped a lot. Our freeze time was reduced from 10 seconds to 5 seconds.
But this is still way too long. I know the best fix would be good disks, but at 
the moment it is not affordable.

Behalf of my research I have lots of questions concerning Checkpoints:

1. With raising checkpoint_timeout, is there any downgrade other than slower 
after-crash recovery?

2. Is there a way to calculate the after-crash recovery time with a certain 
checkpoint_timeout? How long would be approx. for a checkpoint_timeout of 60 
minutes?

3. Is it sane to set checkpoint_timeout to 120min or even to 600min?

4. We have checkpoint_completion_target set to 0.9. The checkpoint is always 
ready much earlier. Is it ought to be like this? I thought the work would be 
spread across 90% of 60 minutes? Instead it takes only 10 Minutes for the 
checkpoint to be finished.

5. Does anybody know if I can set dirty_background_ratio to 0.5? As we have 12 
GB RAM and rather slow disks 0,5% would result in a maximum of 61MB dirty 
pages. 

Your help is very appreciated?

kind regards
Janning

PS: Do I need to post this question on pgsql-perfomance? If so, please let me 
know.

On Wednesday 09 June 2010 18:53:23 Janning wrote:
> Hi,
>
> we currently encounter an increasing load on our website. With the
> increasing load we see some problems on our database. so we checked what
> happens and we saw spikes in our load when checkpoints are about to finish.
>
> Our configuration:
>
> max_connections = 125
> ssl = false
> shared_buffers = 500MB
> work_mem = 15MB
> maintenance_work_mem = 250MB
> synchronous_commit = off
> full_page_writes = off
>
> checkpoint_segments = 10
> checkpoint_timeout = 10min
> checkpoint_completion_target = 0.9
>
> random_page_cost = 2
> effective_cache_size = 5000MB
>
> autovacuum = off
>
> (we put autovacuum to off because we suspected it).
>
> all the other parameters are set to default (beside log parameters and so
> on).
>
> Our machine has 12 GB RAM, i7-975 CPU and a SW-Raid-1 for datadir and
> another one for the rest of the server (including postgresql logfiles).
> Disk are "Barracuda 7200.11 SATA 3Gb/s 1.5-TB" and we are running debian
> lenny.
>
> these are our checkpoint log statements:
>
> 2010-06-09 17:24:27 CEST [6375]: [1-1] LOG:  checkpoint starting: time
> 2010-06-09 17:28:09 CEST [6375]: [2-1] LOG:  checkpoint complete: wrote
> 2861 buffers (4.5%); 0 transaction log file(s) added, 0 removed, 1
> recycled; write=193.057 s, sync=29.259 s, total=222.353 s
> 2010-06-09 17:34:27 CEST [6375]: [3-1] LOG:  checkpoint starting: time
> 2010-06-09 17:39:09 CEST [6375]: [4-1] LOG:  checkpoint complete: wrote
> 3247 buffers (5.1%); 0 transaction log file(s) added, 0 removed, 2
> recycled; write=255.255 s, sync=26.911 s, total=282.177 s
> 2010-06-09 17:44:27 CEST [6375]: [5-1] LOG:  checkpoint starting: time
> 2010-06-09 17:49:41 CEST [6375]: [6-1] LOG:  checkpoint complete: wrote
> 2746 buffers (4.3%); 0 transaction log file(s) added, 0 removed, 2
> recycled; write=280.743 s, sync=33.392 s, total=314.147 s
> 2010-06-09 17:54:27 CEST [6375]: [7-1] LOG:  checkpoint starting: time
> 2010-06-09 17:58:59 CEST [6375]: [8-1] LOG:  checkpoint complete: wrote
> 3118 buffers (4.9%); 0 transaction log file(s) added, 0 removed, 1
> recycled; write=253.293 s, sync=18.585 s, total=271.892 s
> 2010-06-09 18:04:27 CEST [6375]: [9-1] LOG:  checkpoint starting: time
> 2010-06-09 18:08:46 CEST [6375]: [10-1] LOG:  checkpoint complete: wrote
> 2695 buffers (4.2%); 0 transaction log file(s) added, 0 removed, 2
> recycled; write=225.173 s, sync=33.789 s, total=258.972 s
> 2010-06-09 18:14:27 CEST [6375]: [11-1] LOG:  checkpoint starting: time
> 2010-06-09 18:18:30 CEST [6375]: [12-1] LOG:  checkpoint complete: wrote
> 2868 buffers (4.5%); 0 transaction log file(s) added, 0 removed, 2
> recycled; write=215.561 s, sync=27.701 s, total=243.271 s
>
> What we saw is a rather long sync time. And exactly at this time, our
> responses become slow and the server load increases.
>
> this is from pg_stat_bgwriter:
>
>  checkpoints_timed | checkpoints_req | buffers_checkpoint | buffers_clean |
> maxwritten_clean | buffers_backend | buffers_alloc
> -------------------+-----------------+--------------------+---------------+
>------------------+-----------------+--------------- 3495 |               0
> |            9070242 |      10798927 | 14421 |         6412707 |    
> 208340755
>
>
> We tried to increase checkpoint_timeout to 20 minutes but it was getting
> worse:
>
> 2010-06-09 18:34:27 CEST [6375]: [13-1] LOG:  checkpoint starting: time
> 2010-06-09 18:42:49 CEST [6375]: [14-1] LOG:  checkpoint complete: wrote
> 2956 buffers (4.6%); 0 transaction log file(s) added, 0 removed, 1
> recycled; write=448.265 s, sync=54.087 s, total=502.377 s
>
> here we see a 54 sec sync time and a much higher load on sync time compared
> to the 10 minutes checkpoint_timeout.
>
> do you have any hints for us how to tune our configuration to avoid spikes?
>
> kind regards
> Janning

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general