Re: BBU Cache vs. spindles

Rob Wultsch <wultsch@xxxxxxxxx> · Wed, 27 Oct 2010 20:43:52 -0700

On Wed, Oct 27, 2010 at 6:55 PM, Robert Haas <robertmhaas@xxxxxxxxx> wrote:
> On Wed, Oct 27, 2010 at 12:41 AM, Rob Wultsch <wultsch@xxxxxxxxx> wrote:
>> On Tue, Oct 26, 2010 at 7:25 AM, Robert Haas <robertmhaas@xxxxxxxxx> wrote:
>>> On Tue, Oct 26, 2010 at 10:13 AM, Rob Wultsch <wultsch@xxxxxxxxx> wrote:
>>>> The double write buffer is one of the few areas where InnoDB does more
>>>> IO (in the form of fsynch's) than PG. InnoDB also has fuzzy
>>>> checkpoints (which help to keep dirty pages in memory longer),
>>>> buffering of writing out changes to secondary indexes, and recently
>>>> tunable page level compression.
>>>
>>> Baron Schwartz was talking to me about this at Surge.  I don't really
>>> understand how the fuzzy checkpoint stuff works, and I haven't been
>>> able to find a good description of it anywhere.  How does it keep
>>> dirty pages in memory longer?  Details on the other things you mention
>>> would be interesting to hear, too.
>>
>> For checkpoint behavior:
>> http://books.google.com/books?id=S_yHERPRZScC&pg=PA606&lpg=PA606&dq=fuzzy+checkpoint&source=bl&ots=JJrzRUKBGh&sig=UOMPsRy5E-YDgjAFkaSVn3dps_M&hl=en&ei=_k8yTOfeHYzZnAepyumLBA&sa=X&oi=book_result&ct=result&resnum=8&ved=0CEYQ6AEwBw#v=onepage&q=fuzzy%20checkpoint&f=false
>>
>> I would think that best case behavior "sharp" checkpoints with a large
>> checkpoint_completion_target would have behavior similar to a fuzzy
>> checkpoint.
>
> Well, under that definition of a fuzzy checkpoint, our checkpoints are
> fuzzy even with checkpoint_completion_target=0.
>
> What Baron seemed to be describing was a scheme whereby you could do
> what I might call partial checkpoints.  IOW, you want to move the redo
> pointer without writing out ALL the dirty buffers in memory, so you
> write out the pages with the oldest LSNs and then move the redo
> pointer to the oldest LSN you have left.  Except that doesn't quite
> work, because the page might have been dirtied at LSN X and then later
> updated again at LSN Y, and you still have to flush it to disk before
> moving the redo pointer to any value >X.  So you work around that by
> maintaining a "first dirtied" LSN for each page as well as the current
> LSN.
>
> I'm not 100% sure that this is how it works or that it would work in
> PG, but even assuming that it is and does, I'm not sure what the
> benefit is over the checkpoint-spreading logic we have now.  There
> might be some benefit in sorting the writes that we do, so that we can
> spread out the fsyncs.  So, write all the blocks to a give file,
> fsync, and then repeat for each underlying data file that has at least
> one dirty block.  But that's completely orthogonal to (and would
> actually be hindered by) the approach described in the preceding
> paragraph.

I wish I could answer your questions better. I am a power user that
does not fully understand InnoDB internals. There are not all that
many folks that have a very good understanding of InnoDB internals
(given how well it works there is not all that much need).

>
>> Insert (for innodb 1.1+ evidently there is also does delete and purge)
>> buffering:
>> http://dev.mysql.com/doc/refman/5.5/en/innodb-insert-buffering.html
>
> We do something a bit like this for GIST indices.  It would be
> interesting to see if it also has a benefit for btree indices.
>
>> For a recent ~800GB db I had to restore, the insert buffer saved 92%
>> of io needed for secondary indexes.
>>
>> Compression:
>> http://dev.mysql.com/doc/innodb-plugin/1.0/en/innodb-compression-internals.html
>>
>> For many workloads 50% compression results in negligible impact to
>> performance. For certain workloads compression can help performance.
>> Please note that InnoDB also has non-tunable toast like feature.
>
> Interesting.  I am surprised this works well.  It seems that this only
> works for pages that can be compressed by >=50%, which seems like it
> could result in a lot of CPU wasted on failed attempts to compress.

In my world, the spinning disk is almost always the bottleneck.
Trading CPU for IO is almost always a good deal for me.

>
>>>> Given that InnoDB is not shipping its logs across the wire, I don't
>>>> think many users would really care if it used the double writer or
>>>> full page writes approach to the redo log (other than the fact that
>>>> the log files would be bigger). PG on the other hand *is* pushing its
>>>> logs over the wire...
>>>
>>> So how is InnoDB doing replication?  Is there a second log just for that?
>>>
>>
>> The other log is the "binary log" and it is one of the biggest
>> problems with MySQL. Running MySQL in such a way that the binary log
>> stays in sync with the InnoDB redo has a very significant impact on
>> performance.
>> http://www.mysqlperformanceblog.com/2010/10/23/mysql-limitations-part-2-the-binary-log/
>> http://mysqlha.blogspot.com/2010/09/mysql-versus-mongodb-update-performance.html
>> (check out the pretty graph)
>
> Hmm.  That seems kinda painful.  Having to ship full page images over
> the wire doesn't seems so bad by comparison, though I'm not very happy
> about having to do that either.
>

The binary log is less than ideal, but with MySQL replication I can
replicate to *many* servers that are *very* geographically distributed
without all that many headaches. In addition it is simple enough that
I can have junior DBA manage it.  I have doubts that I could make PG
do the same anywhere near as easily, particularly given how long and
narrow some pipes are...

-- 
Rob Wultsch
wultsch@xxxxxxxxx

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance