Faster more low-level methods of having hot standby / secondary read-only servers?

Gunther Schadow <raj@xxxxxxxx> · Tue, 13 Sep 2022 05:42:40 -0400



    I was on this DBA.StackExchange  question
https://dba.stackexchange.com/questions/316715/in-postgresql-repmgr-master-slave-environment-is-it-necessary-to-have-same-h-w-c/316756,
      and which reminded me of my experience and disappointment with the
      hot-standby design.
    Say you have millions of users mostly querying. Then 1000 will
      also create change transactions. It should be really cool to have
      not just one but 10 stand-by's load balancing all the queries
      through PgPool-II (or whenever that gets actually integrated into
      main PgSQL -- anyone thinking about this? After all, PgPool-II
      essentially is 75% Postgresql code, is it not?) 

    
    The problem I found was that WAL log processing requires a lot of
      resources to the point where whatever work the master has done to
      turn the transaction(s) into a WAL content, that seems
      insignificant compared to the work left just applying the WAL to
      the data files. I know this because I tried to run on a very
      insert-busy workload the stand-by on a lesser hardware, hoping
      that it would be enough just to "keep up", but it was not. I
      needed to use the same hardware configuration (I use AWS, so it's
      easy to try out different ones). 

    
    What I then did to catch up, despite the files having grown quite
      big, it was pretty fast to just use rsync on the data files
      themselves, and quickly I was back in sync and could continue with
      processing the WAL updates. 

    
    I think I had asked here (it's over 1 or 2 years ago) to confirm,
      and the conclusion was that, sadly, you do not gain that much free
      server power by using one master and several secondaries, because
      all those secondaries will be quite busy handling the incoming WAL
      from the master, that they have very little spare resources left
      to handle a bunch of querying activity.
    But this gave me an idea. Two ideas actually. 

    
      Use a shared device, also known as "cluster" filesystem, the
        point being, the slave operates on the same physical drive. But
        then that may cause contention with increased head-seek
        activity, which isn't really an issue these days with SSDs.
        Ultimately this pushes the issue down to the hardware where you
        have similar things that adding a stand-by will increase some of
        the load on the master and definitely clog up some significant
        amount of the stand-by resources just with keeping up. But the
        more low level seems way less CPU intensive than applying WAL.
      Why not use some rsync-based replication from the actual
        data-files rather than the WAL? Perhaps such a stand-by would
        not be immediately ready to take over when the master goes down,
        but a combination of an rsync based delta applied to the
        stand-by plus the last few amounts of WAL should be able to
        bring a stand-by to a reliable state just like pure WAL shipping
        method.

      
    This would seem particularly useful if most of the query activity
      is not so critical that it has to be up to the second of update
      from the master. I mean, you can always have a little lag with
      WAL-based replication, so your results from a standby might just
      be a few minutes behind of what you would get by querying the
      master. This means consistent transactions would have to be
      applied to the master anyway, and some querying is involved there.
    Let's think of an airline fare finder use case with a booking
      feature. Most of the activity will be users finding their best
      itinerary comparing fares, conveniences, maybe even available
      seats. Then they start putting a reservation together, but we know
      that there will be attrition, where, say 50% of the reservation
      work will be abandoned. That means we would let the users build
      their reservation, and only when they are getting ready to pay
      would we begin moving the transaction to the master, then re-check
      all the preconditions (e.g., is that seat on 18C still available?)
      and lock the resources, ready to issue the reservation when the
      payment is confirmed, all in one transaction. 

    
    If the stand-by might be 30 seconds behind the master, they could
      be 3 minutes behind too, or 30 min. The less they can be behind,
      the more resources they have to spend on tracking the master. This
      can be tweaked for real world use cases. But clearly it would be
      beneficial to have a means of a light-weight replication which is
      good enough and doesn't take all the massive resources that a WAL
      replay based replication requires.
    Am I way off-base?

    
    Can I come up with a poor-man's implementation of that? For
      example, say I have 3 stand-by servers (the more stand-bys I have
      the more relevant overall this gets, as every stand-by replicates
      the heavy work of applying the WAL to the data files.) I would
      allow each stand-by to fall behind up to n minutes. Say 10. In the
      8th minute I would take it down briefly, rsync the data files from
      the master, and start it back up. This process might take just 2
      minutes out of 10. (I think these are somewhat realistic numbers).
      And my 3 stand-bys rotate doing that, so 2 of the 3 are always up
      while 1 of them might be briefly down.
    This could be improved even with some OS and SAN support, all the
      way down to RAID mirrors. But practically, I could mount the same
      file system (e.g., BSD UFS) on the master with -o rw and on the
      stand-by with -o ro. Even without UFS having any "cluster"
      support, I can do that. I will have a few inconsistencies, but
      PostgreSQL is very tolerant of small inconsistencies and can fix
      them anyway. (I would not dare try anything like this with an
      Oracle system.)

    
    Another tool I could think of using is BSD UFS snapshot support.
      I could make a snapshot that is consistent from the file system
      perspective. PostgreSQL writers could interface with that, issue a
      sync, and trigger an UFS snapshot, then a file system sync. Now
      any standby who reads its data files from this snapshot would not
      only have file-system level consistency, but even database level
      consistency. So with this replication using a file system mounted
      from multiple PgSQL servers, replication should work well while
      consuming minimal amount of server resources and also not lead to
      too much actual disk IO contention (seeks). 

    
    And even the disk contention could possibly be resolved by
      letting a RAID mirror sync to the same snapshot point and then
      split it off, or do the read activity of the stand-by server
      querying like crazy only from that mirror, while batching changes
      to the RAID master so that they can be applied with very low
      overhead. 

    
    Anyone thinking about these things?
    regards,

      -Gunther