On 30.10.2012 10:50, Albe Laurenz wrote:
Why does WAL replay read much more than it writes? I thought that pretty much every block read during WAL replay would also get dirtied and hence written out.
Not necessarily. If a block is modified and written out of the buffer cache before next checkpoint, the latest version of the block is already on disk. On replay, the redo routine reads the block, sees that the change was applied, and does nothing.
I wonder why the performance is good in the first few seconds. Why should exactly the pages that I need in the beginning happen to be in cache?
This is probably because of full_page_writes=on. When replay has a full page image of a block, it doesn't need to read the old contents from disk. It can just blindly write the image to disk. Writing a block to disk also puts that block in the OS cache, so this also efficiently warms the cache from the WAL. Hence in the beginning of replay, you just write a lot of full page images to the OS cache, which is fast, and you only start reading from disk after you've filled up the OS cache. If this theory is true, you should see a pattern in the I/O stats, where in the first seconds there is no I/O, but the CPU is 100% busy while it reads from WAL and writes out the pages to the OS cache. After the OS cache fills up with the dirty pages (up to dirty_ratio, on Linux), you will start to see a lot of writes. As the replay progresses, you will see more and more reads, as you start to get cache misses.
- Heikki -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance