Faster more low-level methods of having hot standby / secondary read-only servers?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I was on this DBA.StackExchange  question https://dba.stackexchange.com/questions/316715/in-postgresql-repmgr-master-slave-environment-is-it-necessary-to-have-same-h-w-c/316756, and which reminded me of my experience and disappointment with the hot-standby design.

Say you have millions of users mostly querying. Then 1000 will also create change transactions. It should be really cool to have not just one but 10 stand-by's load balancing all the queries through PgPool-II (or whenever that gets actually integrated into main PgSQL -- anyone thinking about this? After all, PgPool-II essentially is 75% Postgresql code, is it not?)

The problem I found was that WAL log processing requires a lot of resources to the point where whatever work the master has done to turn the transaction(s) into a WAL content, that seems insignificant compared to the work left just applying the WAL to the data files. I know this because I tried to run on a very insert-busy workload the stand-by on a lesser hardware, hoping that it would be enough just to "keep up", but it was not. I needed to use the same hardware configuration (I use AWS, so it's easy to try out different ones).

What I then did to catch up, despite the files having grown quite big, it was pretty fast to just use rsync on the data files themselves, and quickly I was back in sync and could continue with processing the WAL updates.

I think I had asked here (it's over 1 or 2 years ago) to confirm, and the conclusion was that, sadly, you do not gain that much free server power by using one master and several secondaries, because all those secondaries will be quite busy handling the incoming WAL from the master, that they have very little spare resources left to handle a bunch of querying activity.

But this gave me an idea. Two ideas actually.

  1. Use a shared device, also known as "cluster" filesystem, the point being, the slave operates on the same physical drive. But then that may cause contention with increased head-seek activity, which isn't really an issue these days with SSDs. Ultimately this pushes the issue down to the hardware where you have similar things that adding a stand-by will increase some of the load on the master and definitely clog up some significant amount of the stand-by resources just with keeping up. But the more low level seems way less CPU intensive than applying WAL.
  2. Why not use some rsync-based replication from the actual data-files rather than the WAL? Perhaps such a stand-by would not be immediately ready to take over when the master goes down, but a combination of an rsync based delta applied to the stand-by plus the last few amounts of WAL should be able to bring a stand-by to a reliable state just like pure WAL shipping method.

This would seem particularly useful if most of the query activity is not so critical that it has to be up to the second of update from the master. I mean, you can always have a little lag with WAL-based replication, so your results from a standby might just be a few minutes behind of what you would get by querying the master. This means consistent transactions would have to be applied to the master anyway, and some querying is involved there.

Let's think of an airline fare finder use case with a booking feature. Most of the activity will be users finding their best itinerary comparing fares, conveniences, maybe even available seats. Then they start putting a reservation together, but we know that there will be attrition, where, say 50% of the reservation work will be abandoned. That means we would let the users build their reservation, and only when they are getting ready to pay would we begin moving the transaction to the master, then re-check all the preconditions (e.g., is that seat on 18C still available?) and lock the resources, ready to issue the reservation when the payment is confirmed, all in one transaction.

If the stand-by might be 30 seconds behind the master, they could be 3 minutes behind too, or 30 min. The less they can be behind, the more resources they have to spend on tracking the master. This can be tweaked for real world use cases. But clearly it would be beneficial to have a means of a light-weight replication which is good enough and doesn't take all the massive resources that a WAL replay based replication requires.

Am I way off-base?

Can I come up with a poor-man's implementation of that? For example, say I have 3 stand-by servers (the more stand-bys I have the more relevant overall this gets, as every stand-by replicates the heavy work of applying the WAL to the data files.) I would allow each stand-by to fall behind up to n minutes. Say 10. In the 8th minute I would take it down briefly, rsync the data files from the master, and start it back up. This process might take just 2 minutes out of 10. (I think these are somewhat realistic numbers). And my 3 stand-bys rotate doing that, so 2 of the 3 are always up while 1 of them might be briefly down.

This could be improved even with some OS and SAN support, all the way down to RAID mirrors. But practically, I could mount the same file system (e.g., BSD UFS) on the master with -o rw and on the stand-by with -o ro. Even without UFS having any "cluster" support, I can do that. I will have a few inconsistencies, but PostgreSQL is very tolerant of small inconsistencies and can fix them anyway. (I would not dare try anything like this with an Oracle system.)

Another tool I could think of using is BSD UFS snapshot support. I could make a snapshot that is consistent from the file system perspective. PostgreSQL writers could interface with that, issue a sync, and trigger an UFS snapshot, then a file system sync. Now any standby who reads its data files from this snapshot would not only have file-system level consistency, but even database level consistency. So with this replication using a file system mounted from multiple PgSQL servers, replication should work well while consuming minimal amount of server resources and also not lead to too much actual disk IO contention (seeks).

And even the disk contention could possibly be resolved by letting a RAID mirror sync to the same snapshot point and then split it off, or do the read activity of the stand-by server querying like crazy only from that mirror, while batching changes to the RAID master so that they can be applied with very low overhead.

Anyone thinking about these things?

regards,
-Gunther


[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux