Re: Decreasing WAL size effects

Aidan Van Dyk <aidan@xxxxxxxxxxx> · Fri, 31 Oct 2008 15:11:48 -0400

* Greg Smith <gsmith@xxxxxxxxxxxxx> [081001 00:00]:

> The overhead of clearing out the whole thing is just large enough that it 
> can be disruptive on systems generating lots of WAL traffic, so you don't 
> want the main database processes bothering with that.  A related fact is  
> that there is a noticable slowdown to clients that need a segment switch  
> on a newly initialized and fast system that has to create all its WAL  
> segments, compared to one that has been active long enough to only be  
> recycling them.  That's why this sort of thing has been getting pushed  
> into the archive_command path; nothing performance-sensitive that can 
> slow down clients is happening there, so long as your server is powerful 
> enough to handle that in parallel with everything else going on.

> Now, it would be possible to have that less sensitive archive code path  
> zero things out, but you'd need to introduce a way to note when it's been 
> done (so you don't do it for a segment twice) and a way to turn it off so 
> everybody doesn't go through that overhead (which probably means another  
> GUC).  That's a bit much trouble to go through just for a feature with a  
> fairly limited use-case that can easily live outside of the engine  
> altogether.

Remember that the place where this benifit is big is on a generally idle
server. Is it possible to make the "time based WAL switch" zero the tail?  You
don't even need to fsync it for durability (although you may want to hopefully
preventing a larger fsync delay on the next commit).  

<timid experince=none>
How about something like the attached.  It's been spun quickly, passed
regression tests, and some simple hand tests on REL8_3_STABLE.  It seem slike
HEAD can't  initdb on my machine (quad opteron with SW raid1), I tried a few
revision in the last few days, and initdb dies on them all...

I'm not expert in the PG code, I just greped around what looked like reasonable
functions in xlog.c until I (hopefully) figured out the basic flow of switching
to new xlog segments.    I *think* I'm using openLogFile and openLogOff
correctly.
 </timid>

Setting archiving, with archive_timeout of 30s, and a few hand
pg_start_backup/pg_stop_backup you can see it *really* does make things
really compressable...

It's output is like:
	Archiving 000000010000000000000002
	Archiving 000000010000000000000003
	Archiving 000000010000000000000004
	Archiving 000000010000000000000005
	Archiving 000000010000000000000006
	LOG:  checkpoints are occurring too frequently (10 seconds apart)
	HINT:  Consider increasing the configuration parameter "checkpoint_segments".
	Archiving 000000010000000000000007
	Archiving 000000010000000000000008
	Archiving 000000010000000000000009
	LOG:  checkpoints are occurring too frequently (7 seconds apart)
	HINT:  Consider increasing the configuration parameter "checkpoint_segments".
	Archiving 00000001000000000000000A
	Archiving 00000001000000000000000B
	Archiving 00000001000000000000000C
	LOG:  checkpoints are occurring too frequently (6 seconds apart)
	HINT:  Consider increasing the configuration parameter "checkpoint_segments".
	Archiving 00000001000000000000000D
	LOG:  ZEROING xlog file 0 segment 14 from 12615680 - 16777216 [4161536 bytes]
	STATEMENT:  SELECT pg_stop_backup();
	Archiving 00000001000000000000000E
	Archiving 00000001000000000000000E.00C07098.backup
	LOG:  ZEROING xlog file 0 segment 15 from 8192 - 16777216 [16769024 bytes]
	STATEMENT:  SELECT pg_stop_backup();
	Archiving 00000001000000000000000F
	Archiving 00000001000000000000000F.00000C60.backup
	LOG:  ZEROING xlog file 0 segment 16 from 8192 - 16777216 [16769024 bytes]
	STATEMENT:  SELECT pg_stop_backup();
	Archiving 000000010000000000000010.00000F58.backup
	Archiving 000000010000000000000010
	LOG:  ZEROING xlog file 0 segment 17 from 8192 - 16777216 [16769024 bytes]
	STATEMENT:  SELECT pg_stop_backup();
	Archiving 000000010000000000000011
	Archiving 000000010000000000000011.00000020.backup
	LOG:  ZEROING xlog file 0 segment 18 from 6815744 - 16777216 [9961472 bytes]
	Archiving 000000010000000000000012
	LOG:  ZEROING xlog file 0 segment 19 from 8192 - 16777216 [16769024 bytes]
	Archiving 000000010000000000000013
	LOG:  ZEROING xlog file 0 segment 20 from 16384 - 16777216 [16760832 bytes]
	Archiving 000000010000000000000014
	LOG:  ZEROING xlog file 0 segment 23 from 8192 - 16777216 [16769024 bytes]
	STATEMENT:  SELECT pg_switch_xlog();
	Archiving 000000010000000000000017
	LOG:  ZEROING xlog file 0 segment 24 from 8192 - 16777216 [16769024 bytes]
	Archiving 000000010000000000000018
	LOG:  ZEROING xlog file 0 segment 25 from 8192 - 16777216 [16769024 bytes]
	Archiving 000000010000000000000019

You can see that when DB activity was heavy enough to fill an xlog segment
before the timout (or interative forced switch), it didn't zero anything.  It
only zeroed on a timeout switch, or a forced switch (pg_switch_xlog/pg_stop_backup).

And compressed xlog segments:
	-rw-r--r-- 1 mountie mountie   18477 2008-10-31 14:44 000000010000000000000010.gz
	-rw-r--r-- 1 mountie mountie   16394 2008-10-31 14:44 000000010000000000000011.gz
	-rw-r--r-- 1 mountie mountie 2721615 2008-10-31 14:52 000000010000000000000012.gz
	-rw-r--r-- 1 mountie mountie   16588 2008-10-31 14:52 000000010000000000000013.gz
	-rw-r--r-- 1 mountie mountie   19230 2008-10-31 14:52 000000010000000000000014.gz
	-rw-r--r-- 1 mountie mountie 4920063 2008-10-31 14:52 000000010000000000000015.gz
	-rw-r--r-- 1 mountie mountie 5024705 2008-10-31 14:52 000000010000000000000016.gz
	-rw-r--r-- 1 mountie mountie   18082 2008-10-31 14:52 000000010000000000000017.gz
	-rw-r--r-- 1 mountie mountie   18477 2008-10-31 14:52 000000010000000000000018.gz
	-rw-r--r-- 1 mountie mountie   16394 2008-10-31 14:52 000000010000000000000019.gz
	-rw-r--r-- 1 mountie mountie 2721615 2008-10-31 15:02 00000001000000000000001A.gz
	-rw-r--r-- 1 mountie mountie   16588 2008-10-31 15:02 00000001000000000000001B.gz
	-rw-r--r-- 1 mountie mountie   19230 2008-10-31 15:02 00000001000000000000001C.gz

And yes, even the non-zeroed segments compress well here, because
my test load is pretty simple:
	CREATE TABLE TEST
	(
	 a numeric,
	 b numeric,
	 c numeric,
	 i bigint not null
	);


	INSERT INTO test (a,b,c,i)
	  SELECT random(),random(),random(),s FROM generate_series(1,1000000) s;


a.


-- 
Aidan Van Dyk                                             Create like a god,
aidan@xxxxxxxxxxx                                       command like a king,
http://www.highrise.ca/                                   work like a slave.
commit 3916c54126ffade0baad4609467393d9a1b53e37
Author: Aidan Van Dyk <aidan@xxxxxxxxxxx>
Date:   Fri Oct 31 12:35:24 2008 -0400

    WIP: Zero xlog tal on a forced switch
    
    If XLogWrite is called with xlog_switch, an XLog swithc has been force, either
    by a timeout based switch (archive_timeout), or an interactive force xlog
    switch (pg_switch_xlog/pg_stop_backup).  In those cases, we assume we can
    afford a little extra IO bandwidth to make xlogs so much more compressable

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8bc46da..a8d945d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1548,6 +1548,30 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
 			 */
 			if (finishing_seg || (xlog_switch && last_iteration))
 			{
+				/*
+				 * If we've had an xlog switch forced, then we want to zero
+				 * out the rest of the segment.  We zero it out here because at the
+				 * force switch time, IO bandwidth isn't a problem.
+				 *   -- AIDAN
+				 */
+				if (xlog_switch)
+				{
+					char buf[1024];
+					uint32 left = (XLogSegSize - openLogOff);
+					ereport(LOG,
+						(errmsg("ZEROING xlog file %u segment %u from %u - %u [%u bytes]",
+								openLogId, openLogSeg,
+								openLogOff, XLogSegSize, left)
+						 ));
+					memset(buf, 0, sizeof(buf));
+					while (left > 0)
+					{
+						size_t len = (left > sizeof(buf)) ? sizeof(buf) : left;
+						write(openLogFile, buf, len);
+						left -= len;
+					}
+				}
+
 				issue_xlog_fsync();
 				LogwrtResult.Flush = LogwrtResult.Write;		/* end of page */
 
Attachment:
signature.asc

Description: Digital signature