9.0 replication: observation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



About a month ago I upgraded five Linux (CentOS 4.4) DB boxes (for load balancing) from 8.3.0 to 9.0.1 in order to use replication, which I needed for a new web/DB application. The replication configuration went well, and all the four "hot_standby" servers are (streaming) replicating just fine from the primary DB server. There is one (separate) web server that uses "round-robin" load balancing for queries to the five DB servers. Note that the primary DB server and one "hot standby" server are reasonably fast dual-core 2.8GHz boxes; the other three vary in performance, but are slower, and one is several years old with a single 1GHz processor, but still gives reasonable query performance. This last box is the subject of this message:

The PostgreSQL installation on each box consists of several "logical" databases: Most of them are either updated or replaced in their entirety from large US Gov't database downloads, at the beginning either of every week or every month. Each update/replacement file is 400-700MB (ASCII); the resulting PostgreSQL .../data directory is about 7GB.

However, one of the logical databases is relatively small (currently 25MB ASCII equivalent but growing), and receives a new transaction (INSERT) about every 5 seconds.

All this has been working fine until this morning (the first of the month!), when I noticed that connections/queries to the slowest DB server were sometimes timing out, and otherwise serving up old data. A query to the this box revealed that the most recent data in the small logical DB (the one that receives INSERTs about every 5 seconds) was about three hours old. At first I thought that PostgreSQL on the slowest box was hung, or had lost its connection to the master. However, a further check showed that PostgreSQL on this box was apparently I/O bound, and the small DB was slowly being updated, albeit not as fast as new transactions were coming in to the primary DB server. All the other boxes showed current data in the small DB.

A check of other logs showed that the primary had just completed a massive UPDATE of one of its tables (plus INDEX rebuilding) a few minutes before (that takes 30 minutes on the primary). So, I temporarily removed the slow DB server from the web server's round-robin "ring", and went and took a bath. When I got out of the bath I was clean, and the slow DB server had caught up and was now serving current data.

I find that impressive, to be three hours out of date and still catch up. That's a testament to the quality of the design, code & testing of replication (and a first release at that). I have over forty years of experience in software development (mostly embedded systems), and I don't hand out software complements easily, and for good reason: I think most of the software out there is CRAP.

Now, I have to figure out a way to automatically deal with the slow DB server during those massive replacements (ie, disabling queries until the slow box is current). Since the "round-robin" implementation is in PHP on the web server, that shouldn't be too difficult.

-- Dean


--
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin


[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux