Re: Replaying 48 WAL files takes 80 minutes

Heikki Linnakangas <hlinnakangas@xxxxxxxxxx> · Tue, 30 Oct 2012 12:07:48 +0200

On 30.10.2012 10:50, Albe Laurenz wrote:
Why does WAL replay read much more than it writes?
I thought that pretty much every block read during WAL
replay would also get dirtied and hence written out.

Not necessarily. If a block is modified and written out of the buffer 
cache before next checkpoint, the latest version of the block is already 
on disk. On replay, the redo routine reads the block, sees that the 
change was applied, and does nothing.

I wonder why the performance is good in the first few seconds.
Why should exactly the pages that I need in the beginning
happen to be in cache?

This is probably because of full_page_writes=on. When replay has a full 
page image of a block, it doesn't need to read the old contents from 
disk. It can just blindly write the image to disk. Writing a block to 
disk also puts that block in the OS cache, so this also efficiently 
warms the cache from the WAL. Hence in the beginning of replay, you just 
write a lot of full page images to the OS cache, which is fast, and you 
only start reading from disk after you've filled up the OS cache. If 
this theory is true, you should see a pattern in the I/O stats, where in 
the first seconds there is no I/O, but the CPU is 100% busy while it 
reads from WAL and writes out the pages to the OS cache. After the OS 
cache fills up with the dirty pages (up to dirty_ratio, on Linux), you 
will start to see a lot of writes. As the replay progresses, you will 
see more and more reads, as you start to get cache misses.

- Heikki

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance