On 8/10/2010 3:28 PM, Karl Denninger wrote:
Brad Nicholson wrote:On 8/10/2010 2:38 PM, Karl Denninger wrote:CAREFUL with that model and beliefs.Scott Marlowe wrote:A valid case is a Slony replica if used for query offloading (not for DR). It's considered a read-only subscriber from the perspective of Slony as only Slony can modify the data (although you are technically correct, it is not read only - controlled write may be more accurate).A read-only slave isn't read-only, is it?On Tue, Aug 10, 2010 at 12:13 PM, Karl Denninger <karl@xxxxxxxxxxxxx> wrote:ANY disk that says "write is complete" when it really is not is entirely unsuitable for ANY real database use. It is simply a matter of timeWhat about read only slaves where there's a master with 100+spinning hard drives "getting it right" and you need a half dozen or so read slaves? I can imagine that being ok, as long as you don't restart a server after a crash without checking on it. Specifically, the following will hose you without warning:What will hose you is assuming that your data will be okay in the case of a failure, which is a very bad assumption to make in the case on unreliable SSD's. You are assuming I am implying that these should be treated like reliable media - I am not. In case of failure, you need to assume data loss until proven otherwise. If there is a problem, rebuild. When the slave restarts it will not know that the transaction was lost. Neither will the master, since it was told that it was committed. Slony will happily go on its way and replicate forward, without any indication of a problem - except that on the slave, there are one or more transactions that are **missing**. Correct. Some time later you issue an update that goes to the slave, but the change previously lost causes the slave commit to violate referential integrity. SLONY will fail to propagate that change and all behind it - it effectively locks at that point in time.It will lock data flow to that subscriber, but not to others. You can recover from this by dropping the slave from replication and re-inserting it, but that forces a full-table copy of everything in the replication set. The bad news is that the queries to the slave in question may have been returning erroneous data for some unknown period of time prior to the lockup in replication (which hopefully you detect reasonably quickly - you ARE watching SLONY queue depth with some automated process, right?)There are ways around that - run two subscribers and redirect your queries on failure. Don't bring up the failed replica until it is verified or rebuilt. I can both cause this in the lab and have had it happen in the field. It's a nasty little problem that bit me on a series of disks that claimed to have write caching off, but in fact did not. I was very happy that the data on the master was good at that point, as if I had needed to failover to the slave (thinking it was a "good" copy) I would have been in SERIOUS trouble. It's very easy to cause those sorts of problems. What I am saying is that the technology can have a use, if you are aware of the sharp edges, and can both work around them and live with them. Everything you are citing is correct, but is more implying that they they are blindly thrown in without understanding the risks and mitigating them. I'm also not suggesting that this is a configuration I would endorse, but it could potentially save a lot of money in certain use cases. -- Brad Nicholson 416-673-4106 Database Administrator, Afilias Canada Corp. |