Heikki Linnakangas wrote: >> Why does WAL replay read much more than it writes? >> I thought that pretty much every block read during WAL >> replay would also get dirtied and hence written out. > > Not necessarily. If a block is modified and written out of the buffer > cache before next checkpoint, the latest version of the block is already > on disk. On replay, the redo routine reads the block, sees that the > change was applied, and does nothing. True. Could that account for 1000 times more reads than writes? >> I wonder why the performance is good in the first few seconds. >> Why should exactly the pages that I need in the beginning >> happen to be in cache? > > This is probably because of full_page_writes=on. When replay has a full > page image of a block, it doesn't need to read the old contents from > disk. It can just blindly write the image to disk. Writing a block to > disk also puts that block in the OS cache, so this also efficiently > warms the cache from the WAL. Hence in the beginning of replay, you just > write a lot of full page images to the OS cache, which is fast, and you > only start reading from disk after you've filled up the OS cache. If > this theory is true, you should see a pattern in the I/O stats, where in > the first seconds there is no I/O, but the CPU is 100% busy while it > reads from WAL and writes out the pages to the OS cache. After the OS > cache fills up with the dirty pages (up to dirty_ratio, on Linux), you > will start to see a lot of writes. As the replay progresses, you will > see more and more reads, as you start to get cache misses. That makes sense to me. Unfortunately I don't have statistics in the required resolution to verify that. Thanks for the explanations. Yours, Laurenz Albe -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance