About a month ago I upgraded five Linux (CentOS 4.4) DB boxes (for load
balancing) from 8.3.0 to 9.0.1 in order to use replication, which I
needed for a new web/DB application. The replication configuration went
well, and all the four "hot_standby" servers are (streaming) replicating
just fine from the primary DB server. There is one (separate) web
server that uses "round-robin" load balancing for queries to the five DB
servers. Note that the primary DB server and one "hot standby" server
are reasonably fast dual-core 2.8GHz boxes; the other three vary in
performance, but are slower, and one is several years old with a single
1GHz processor, but still gives reasonable query performance. This last
box is the subject of this message:
The PostgreSQL installation on each box consists of several "logical"
databases: Most of them are either updated or replaced in their
entirety from large US Gov't database downloads, at the beginning either
of every week or every month. Each update/replacement file is 400-700MB
(ASCII); the resulting PostgreSQL .../data directory is about 7GB.
However, one of the logical databases is relatively small (currently
25MB ASCII equivalent but growing), and receives a new transaction
(INSERT) about every 5 seconds.
All this has been working fine until this morning (the first of the
month!), when I noticed that connections/queries to the slowest DB
server were sometimes timing out, and otherwise serving up old data. A
query to the this box revealed that the most recent data in the small
logical DB (the one that receives INSERTs about every 5 seconds) was
about three hours old. At first I thought that PostgreSQL on the
slowest box was hung, or had lost its connection to the master.
However, a further check showed that PostgreSQL on this box was
apparently I/O bound, and the small DB was slowly being updated, albeit
not as fast as new transactions were coming in to the primary DB
server. All the other boxes showed current data in the small DB.
A check of other logs showed that the primary had just completed a
massive UPDATE of one of its tables (plus INDEX rebuilding) a few
minutes before (that takes 30 minutes on the primary). So, I
temporarily removed the slow DB server from the web server's round-robin
"ring", and went and took a bath. When I got out of the bath I was
clean, and the slow DB server had caught up and was now serving current
data.
I find that impressive, to be three hours out of date and still catch
up. That's a testament to the quality of the design, code & testing of
replication (and a first release at that). I have over forty years of
experience in software development (mostly embedded systems), and I
don't hand out software complements easily, and for good reason: I
think most of the software out there is CRAP.
Now, I have to figure out a way to automatically deal with the slow DB
server during those massive replacements (ie, disabling queries until
the slow box is current). Since the "round-robin" implementation is in
PHP on the web server, that shouldn't be too difficult.
-- Dean
--
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin