bgwriter autotuning might be unnecessarily penalizing bursty workloads

Josh Berkus <josh@xxxxxxxxxxxx> · Wed, 17 Jul 2013 15:46:23 -0700

Greg, All:

So, I've been running some stats over some customer workloads who are
having issues with checkpoint spikes, and I'm finding that the bgwriter
is almost entirely ineffective for them:

pct_checkpoints_req     | 33.0
avg_frequency_min       | 2.30
avg_write_time_s        | 112.17
avg_sync_time_s         | 1.33
mb_written              | 2387369.6
mb_written_per_min      | 61.01
mb_per_checkpoint       | 82.27
pct_checkpoint_buffers  | 58.6
pct_bgwriter_buffers    | 0.3
pct_backend_buffers     | 41.1
bgwriter_halt_freq      | 0.06
bgwriter_halt_potential | 70.11
buffer_allocation_ratio | 1.466

(query for the above is below)

The key metric I'm looking at is that the bgwriter only took care of
0.3% of buffers.  Yet average write throughput is around 1mb/s, and the
bgwriter is capable of flushing 4mb/s, if it's waking up every 200ms.

Of course, our first conclusion is that the writes are very bursty and
the bgwriter is frequently hitting lru_maxpages.  In fact, that seems to
be the case, per bgwriter_halt_potential above (this is a measurement of
the % of the time the bgwriter halted vs if all buffer writes were done
in one continuous session where it couldn't keep up).  And from the raw
pg_stat_bgwriter:

maxwritten_clean      | 6950

So we're halting because we hit lru_maxpages a *lot*, which is keeping
the bgwriter from keeping up.  What gives?

Well, digging into the docs, one thing I noticed was this note:

"It then sleeps for bgwriter_delay milliseconds, and repeats. When there
are no dirty buffers in the buffer pool, though, it goes into a longer
sleep regardless of bgwriter_delay."

combined with:

"The number of dirty buffers written in each round is based on the
number of new buffers that have been needed by server processes during
recent rounds"

... so Greg built in the bgwriter autotuner with ramp-up/down behavior,
where it sleeps longer and writes less if it hasn't been busy lately.
But given the stats I'm looking at, I'm wondering if that isn't too much
of a double-whammy for people with bursty workloads.

That is, if you have several seconds of inactivity followed by a big
write, then the bgwriter will wake up slowly (which, btw, is not
manually tunable), and then write very little when it does wake up, at
least in the first round.

Of course, I may be misinterpreting the data in front of me ... I'm
currently running a week-long test of raising lru_maxpages and
decreasing bgwriter_delay to see how it affects things ... but I wanted
to discuss it on-list.

Bgwriter stats query follows:

with bgstats as (
    select checkpoints_timed,
        checkpoints_req,
        checkpoints_timed + checkpoints_req as checkpoints,
        checkpoint_sync_time,
        checkpoint_write_time,
        buffers_checkpoint,
        buffers_clean,
        maxwritten_clean,
        buffers_backend,
        buffers_backend_fsync,
        buffers_alloc,
        buffers_checkpoint + buffers_clean + buffers_backend as
total_buffers,
        round(extract('epoch' from now() - stats_reset)/60)::numeric as
min_since_reset,
        lru.setting::numeric as bgwriter_maxpages,
        delay.setting::numeric as bgwriter_delay
    from pg_stat_bgwriter
        cross join pg_settings as lru
        cross join pg_settings as delay
    where lru.name = 'bgwriter_lru_maxpages'
        and delay.name = 'bgwriter_delay'
)
select
    round(checkpoints_req*100/checkpoints,1) as pct_checkpoints_req,
    round(min_since_reset/checkpoints,2) as avg_frequency_min,
    round(checkpoint_write_time::numeric/(checkpoints*1000),2) as
avg_write_time_s,
    round(checkpoint_sync_time::numeric/(checkpoints*1000),2) as
avg_sync_time_s,
    round(total_buffers/128::numeric,1) as mb_written,
    round(total_buffers/(128 * min_since_reset),2) as mb_written_per_min,
    round(buffers_checkpoint/(checkpoints*128::numeric),2) as
mb_per_checkpoint,
    round(buffers_checkpoint*100/total_buffers::numeric,1) as
pct_checkpoint_buffers,
    round(buffers_clean*100/total_buffers::numeric,1) as
pct_bgwriter_buffers,
    round(buffers_backend*100/total_buffers::numeric,1) as
pct_backend_buffers,

round(maxwritten_clean*100::numeric/(min_since_reset*60000/bgwriter_delay),2)
as bgwriter_halt_freq,

round(maxwritten_clean*100::numeric/(buffers_clean/bgwriter_maxpages),2)
as bgwriter_halt_potential,
    round(buffers_alloc::numeric/total_buffers,3) as buffer_allocation_ratio
from bgstats;

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance