Re: shared_buffers 8GB maximum

Vitaliy Garnashevich <vgarnashevich@xxxxxxxxx> · Mon, 19 Feb 2018 14:37:43 +0200



            Yes.  I don't know the exact reason, but reading a
              buffer from OS cache is quite a bit more expensive than
              just pinning a buffer already in the buffer_pool, about 5
              times more expensive the last time I tested it, which was
              before Meltdown.  (And just pinning a buffer which is
              already in the cache is already pretty expensive--about 15
              times as expensive as reading the next tuple from an
              already-pinned buffer).
          
        
    Thanks for the numbers. Just out of curiosity, do you happen to know
    how much more expensive compared to that a read from disk is? And
    also, how much the pinning can be slowed down, when having to
    iterate using the clock-sweep method over large shared_buffers?

    
            I don't think that there is any reason to think that
              buffers_clean > buffers_checkpoint is a problem.  In
              fact, you could argue that it is the way it was designed
              to work.  Although the background writer does need to tell
              the checkpointer about every file it dirties, so it can be
              fsynced at the end of the checkpoint.  The overhead of
              this was minimal in my testing.
            

    The reason why I mentioned buffers_clean is because I was assuming
    that under "healthy" conditions, most writes should be done by
    checkpointer, because, as it was already mentioned, that's the most
    efficient way of writing (no duplicate writes of the same buffer,
    write optimizations etc.). I was thinking about bgwriter as a way of
    reducing latency by avoiding the case when a backend has to write
    buffers by itself. So that would mean that big
      numbers in buffers_clean and buffers_backend
    compared
        to buffers_checkpoint, would mean that a lot of writes
      are done not by checkpointer, and thus probably less efficiently
      than they could be. That might have resulted in IO writes being
      more random, and more IO writes done in general, because same
      buffer can be written multiple times between checkpoints.

    
              But
                buffers_backend > buffers_checkpoint could be a
                problem, especially if they are also much larger than
                buffers_clean.  But the wrinkle here is that if you do
                bulk inserts or bulk updates (what about vacuums?), the
                backends by design write their own dirty buffers.  So if
                you do those kinds of things, buffers_backend being
                large doesn't indicate much.  There was a patch
                someplace a while ago to separate the counters of
                backend-intentional writes from backend-no-choice
                writes, but it never went anywhere.  

            
      We do daily manual vacuuming. Knowing what part of total writes is
      accounted for them indeed would be nice.

      
      When looking at buffers_checkpoint/buffers_clean/buffers_backend,
      I was saving the numbers with several hours interval, knowing that
      there are no vacuums running at that time, and calculated the
      difference.

      
            It is not clear to me that this is the best way to
              measure health.  Did your response time go down?  Did your
              throughput go up?
          
        
    We have mixed type of DB usage. There is OLTP-like part with many
    small read/write transactions. Predictable latency does not matter
    in that case, but throughput does, because that is basically a
    background data loading job. Then there is an OLAP-like part when
    heavier report queries are being run. Then there are more background
    jobs which are a combination of both, which at first run long
    queries and then do lots of small inserts, thus pre-calculating some
    data for bigger reports.

    
    After increasing shared_buffers 8GB -> 64GB, there was 7%
    improvement in run time of the background pre-calculating job
    (measured by running several times in a row, and caches are hot).

    
    When we configured hugepages for the bigger shared_buffers, the
    additional improvement was around 3%.

    
    Regards,

    Vitaliy