On Wed, Oct 27, 2010 at 6:55 PM, Robert Haas <robertmhaas@xxxxxxxxx> wrote: > On Wed, Oct 27, 2010 at 12:41 AM, Rob Wultsch <wultsch@xxxxxxxxx> wrote: >> On Tue, Oct 26, 2010 at 7:25 AM, Robert Haas <robertmhaas@xxxxxxxxx> wrote: >>> On Tue, Oct 26, 2010 at 10:13 AM, Rob Wultsch <wultsch@xxxxxxxxx> wrote: >>>> The double write buffer is one of the few areas where InnoDB does more >>>> IO (in the form of fsynch's) than PG. InnoDB also has fuzzy >>>> checkpoints (which help to keep dirty pages in memory longer), >>>> buffering of writing out changes to secondary indexes, and recently >>>> tunable page level compression. >>> >>> Baron Schwartz was talking to me about this at Surge. I don't really >>> understand how the fuzzy checkpoint stuff works, and I haven't been >>> able to find a good description of it anywhere. How does it keep >>> dirty pages in memory longer? Details on the other things you mention >>> would be interesting to hear, too. >> >> For checkpoint behavior: >> http://books.google.com/books?id=S_yHERPRZScC&pg=PA606&lpg=PA606&dq=fuzzy+checkpoint&source=bl&ots=JJrzRUKBGh&sig=UOMPsRy5E-YDgjAFkaSVn3dps_M&hl=en&ei=_k8yTOfeHYzZnAepyumLBA&sa=X&oi=book_result&ct=result&resnum=8&ved=0CEYQ6AEwBw#v=onepage&q=fuzzy%20checkpoint&f=false >> >> I would think that best case behavior "sharp" checkpoints with a large >> checkpoint_completion_target would have behavior similar to a fuzzy >> checkpoint. > > Well, under that definition of a fuzzy checkpoint, our checkpoints are > fuzzy even with checkpoint_completion_target=0. > > What Baron seemed to be describing was a scheme whereby you could do > what I might call partial checkpoints. IOW, you want to move the redo > pointer without writing out ALL the dirty buffers in memory, so you > write out the pages with the oldest LSNs and then move the redo > pointer to the oldest LSN you have left. Except that doesn't quite > work, because the page might have been dirtied at LSN X and then later > updated again at LSN Y, and you still have to flush it to disk before > moving the redo pointer to any value >X. So you work around that by > maintaining a "first dirtied" LSN for each page as well as the current > LSN. > > I'm not 100% sure that this is how it works or that it would work in > PG, but even assuming that it is and does, I'm not sure what the > benefit is over the checkpoint-spreading logic we have now. There > might be some benefit in sorting the writes that we do, so that we can > spread out the fsyncs. So, write all the blocks to a give file, > fsync, and then repeat for each underlying data file that has at least > one dirty block. But that's completely orthogonal to (and would > actually be hindered by) the approach described in the preceding > paragraph. I wish I could answer your questions better. I am a power user that does not fully understand InnoDB internals. There are not all that many folks that have a very good understanding of InnoDB internals (given how well it works there is not all that much need). > >> Insert (for innodb 1.1+ evidently there is also does delete and purge) >> buffering: >> http://dev.mysql.com/doc/refman/5.5/en/innodb-insert-buffering.html > > We do something a bit like this for GIST indices. It would be > interesting to see if it also has a benefit for btree indices. > >> For a recent ~800GB db I had to restore, the insert buffer saved 92% >> of io needed for secondary indexes. >> >> Compression: >> http://dev.mysql.com/doc/innodb-plugin/1.0/en/innodb-compression-internals.html >> >> For many workloads 50% compression results in negligible impact to >> performance. For certain workloads compression can help performance. >> Please note that InnoDB also has non-tunable toast like feature. > > Interesting. I am surprised this works well. It seems that this only > works for pages that can be compressed by >=50%, which seems like it > could result in a lot of CPU wasted on failed attempts to compress. In my world, the spinning disk is almost always the bottleneck. Trading CPU for IO is almost always a good deal for me. > >>>> Given that InnoDB is not shipping its logs across the wire, I don't >>>> think many users would really care if it used the double writer or >>>> full page writes approach to the redo log (other than the fact that >>>> the log files would be bigger). PG on the other hand *is* pushing its >>>> logs over the wire... >>> >>> So how is InnoDB doing replication? Is there a second log just for that? >>> >> >> The other log is the "binary log" and it is one of the biggest >> problems with MySQL. Running MySQL in such a way that the binary log >> stays in sync with the InnoDB redo has a very significant impact on >> performance. >> http://www.mysqlperformanceblog.com/2010/10/23/mysql-limitations-part-2-the-binary-log/ >> http://mysqlha.blogspot.com/2010/09/mysql-versus-mongodb-update-performance.html >> (check out the pretty graph) > > Hmm. That seems kinda painful. Having to ship full page images over > the wire doesn't seems so bad by comparison, though I'm not very happy > about having to do that either. > The binary log is less than ideal, but with MySQL replication I can replicate to *many* servers that are *very* geographically distributed without all that many headaches. In addition it is simple enough that I can have junior DBA manage it. I have doubts that I could make PG do the same anywhere near as easily, particularly given how long and narrow some pipes are... -- Rob Wultsch wultsch@xxxxxxxxx -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance