* Greg Smith <gsmith@xxxxxxxxxxxxx> [081001 00:00]: > The overhead of clearing out the whole thing is just large enough that it > can be disruptive on systems generating lots of WAL traffic, so you don't > want the main database processes bothering with that. A related fact is > that there is a noticable slowdown to clients that need a segment switch > on a newly initialized and fast system that has to create all its WAL > segments, compared to one that has been active long enough to only be > recycling them. That's why this sort of thing has been getting pushed > into the archive_command path; nothing performance-sensitive that can > slow down clients is happening there, so long as your server is powerful > enough to handle that in parallel with everything else going on. > Now, it would be possible to have that less sensitive archive code path > zero things out, but you'd need to introduce a way to note when it's been > done (so you don't do it for a segment twice) and a way to turn it off so > everybody doesn't go through that overhead (which probably means another > GUC). That's a bit much trouble to go through just for a feature with a > fairly limited use-case that can easily live outside of the engine > altogether. Remember that the place where this benifit is big is on a generally idle server. Is it possible to make the "time based WAL switch" zero the tail? You don't even need to fsync it for durability (although you may want to hopefully preventing a larger fsync delay on the next commit). <timid experince=none> How about something like the attached. It's been spun quickly, passed regression tests, and some simple hand tests on REL8_3_STABLE. It seem slike HEAD can't initdb on my machine (quad opteron with SW raid1), I tried a few revision in the last few days, and initdb dies on them all... I'm not expert in the PG code, I just greped around what looked like reasonable functions in xlog.c until I (hopefully) figured out the basic flow of switching to new xlog segments. I *think* I'm using openLogFile and openLogOff correctly. </timid> Setting archiving, with archive_timeout of 30s, and a few hand pg_start_backup/pg_stop_backup you can see it *really* does make things really compressable... It's output is like: Archiving 000000010000000000000002 Archiving 000000010000000000000003 Archiving 000000010000000000000004 Archiving 000000010000000000000005 Archiving 000000010000000000000006 LOG: checkpoints are occurring too frequently (10 seconds apart) HINT: Consider increasing the configuration parameter "checkpoint_segments". Archiving 000000010000000000000007 Archiving 000000010000000000000008 Archiving 000000010000000000000009 LOG: checkpoints are occurring too frequently (7 seconds apart) HINT: Consider increasing the configuration parameter "checkpoint_segments". Archiving 00000001000000000000000A Archiving 00000001000000000000000B Archiving 00000001000000000000000C LOG: checkpoints are occurring too frequently (6 seconds apart) HINT: Consider increasing the configuration parameter "checkpoint_segments". Archiving 00000001000000000000000D LOG: ZEROING xlog file 0 segment 14 from 12615680 - 16777216 [4161536 bytes] STATEMENT: SELECT pg_stop_backup(); Archiving 00000001000000000000000E Archiving 00000001000000000000000E.00C07098.backup LOG: ZEROING xlog file 0 segment 15 from 8192 - 16777216 [16769024 bytes] STATEMENT: SELECT pg_stop_backup(); Archiving 00000001000000000000000F Archiving 00000001000000000000000F.00000C60.backup LOG: ZEROING xlog file 0 segment 16 from 8192 - 16777216 [16769024 bytes] STATEMENT: SELECT pg_stop_backup(); Archiving 000000010000000000000010.00000F58.backup Archiving 000000010000000000000010 LOG: ZEROING xlog file 0 segment 17 from 8192 - 16777216 [16769024 bytes] STATEMENT: SELECT pg_stop_backup(); Archiving 000000010000000000000011 Archiving 000000010000000000000011.00000020.backup LOG: ZEROING xlog file 0 segment 18 from 6815744 - 16777216 [9961472 bytes] Archiving 000000010000000000000012 LOG: ZEROING xlog file 0 segment 19 from 8192 - 16777216 [16769024 bytes] Archiving 000000010000000000000013 LOG: ZEROING xlog file 0 segment 20 from 16384 - 16777216 [16760832 bytes] Archiving 000000010000000000000014 LOG: ZEROING xlog file 0 segment 23 from 8192 - 16777216 [16769024 bytes] STATEMENT: SELECT pg_switch_xlog(); Archiving 000000010000000000000017 LOG: ZEROING xlog file 0 segment 24 from 8192 - 16777216 [16769024 bytes] Archiving 000000010000000000000018 LOG: ZEROING xlog file 0 segment 25 from 8192 - 16777216 [16769024 bytes] Archiving 000000010000000000000019 You can see that when DB activity was heavy enough to fill an xlog segment before the timout (or interative forced switch), it didn't zero anything. It only zeroed on a timeout switch, or a forced switch (pg_switch_xlog/pg_stop_backup). And compressed xlog segments: -rw-r--r-- 1 mountie mountie 18477 2008-10-31 14:44 000000010000000000000010.gz -rw-r--r-- 1 mountie mountie 16394 2008-10-31 14:44 000000010000000000000011.gz -rw-r--r-- 1 mountie mountie 2721615 2008-10-31 14:52 000000010000000000000012.gz -rw-r--r-- 1 mountie mountie 16588 2008-10-31 14:52 000000010000000000000013.gz -rw-r--r-- 1 mountie mountie 19230 2008-10-31 14:52 000000010000000000000014.gz -rw-r--r-- 1 mountie mountie 4920063 2008-10-31 14:52 000000010000000000000015.gz -rw-r--r-- 1 mountie mountie 5024705 2008-10-31 14:52 000000010000000000000016.gz -rw-r--r-- 1 mountie mountie 18082 2008-10-31 14:52 000000010000000000000017.gz -rw-r--r-- 1 mountie mountie 18477 2008-10-31 14:52 000000010000000000000018.gz -rw-r--r-- 1 mountie mountie 16394 2008-10-31 14:52 000000010000000000000019.gz -rw-r--r-- 1 mountie mountie 2721615 2008-10-31 15:02 00000001000000000000001A.gz -rw-r--r-- 1 mountie mountie 16588 2008-10-31 15:02 00000001000000000000001B.gz -rw-r--r-- 1 mountie mountie 19230 2008-10-31 15:02 00000001000000000000001C.gz And yes, even the non-zeroed segments compress well here, because my test load is pretty simple: CREATE TABLE TEST ( a numeric, b numeric, c numeric, i bigint not null ); INSERT INTO test (a,b,c,i) SELECT random(),random(),random(),s FROM generate_series(1,1000000) s; a. -- Aidan Van Dyk Create like a god, aidan@xxxxxxxxxxx command like a king, http://www.highrise.ca/ work like a slave.
commit 3916c54126ffade0baad4609467393d9a1b53e37 Author: Aidan Van Dyk <aidan@xxxxxxxxxxx> Date: Fri Oct 31 12:35:24 2008 -0400 WIP: Zero xlog tal on a forced switch If XLogWrite is called with xlog_switch, an XLog swithc has been force, either by a timeout based switch (archive_timeout), or an interactive force xlog switch (pg_switch_xlog/pg_stop_backup). In those cases, we assume we can afford a little extra IO bandwidth to make xlogs so much more compressable diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 8bc46da..a8d945d 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -1548,6 +1548,30 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch) */ if (finishing_seg || (xlog_switch && last_iteration)) { + /* + * If we've had an xlog switch forced, then we want to zero + * out the rest of the segment. We zero it out here because at the + * force switch time, IO bandwidth isn't a problem. + * -- AIDAN + */ + if (xlog_switch) + { + char buf[1024]; + uint32 left = (XLogSegSize - openLogOff); + ereport(LOG, + (errmsg("ZEROING xlog file %u segment %u from %u - %u [%u bytes]", + openLogId, openLogSeg, + openLogOff, XLogSegSize, left) + )); + memset(buf, 0, sizeof(buf)); + while (left > 0) + { + size_t len = (left > sizeof(buf)) ? sizeof(buf) : left; + write(openLogFile, buf, len); + left -= len; + } + } + issue_xlog_fsync(); LogwrtResult.Flush = LogwrtResult.Write; /* end of page */
Attachment:
signature.asc
Description: Digital signature