David Sickmiller wrote:
I'm running 2.0rc1 with the 2.6.27 kernel. I have a 2-node cluster.
GlusterFS runs on both nodes, and MySQL runs on the active node. If the
active node fails or is put on standby, MySQL fires up on the other
node. Unlike MySQL Replication with its slave lag, I know my data
changes are durable in the event of a server failure. Most people use
DRBD for this, but I'm hoping to enjoy GlusterFS's benefits of handling
split-brain situations at the file level instead of the volume level,
future scalability avenues, and general ease of use. Hopefully DRBD
doesn't have unmatchable performance advantages I'm overlooking.
Note that DRBD resync is more efficient - it only resyncs dirty blocks,
which in the case of big databases, can be much faster. Gluster will
copy the whole file.
I'm going to report my testing in order, because the changes were
cumulative. I used server-side io-threads from the start. Before I
started recording the speed, I discovered that running in single process
mode was dramatically faster. At that time, I also configured
read-subvolume to use the local server. At this point I started measuring:
* Printing schema: 18s
* Compressed export: 2m45s
For a benchmark, I moved MySQL's datafiles to the local ext3 disk (but
kept writing the export to GlusterFS). It was 10-100X faster!
* Printing schema: 0.2s
* Compressed export: 28s
Did you flush the caches inbetween the tries? What is your network
connection between the nodes?
There was no appreciable changes from installing fuse-2.7.4glfs11, using
Booster, or running blockdev to increase readahead from 256 to 16384.
Adding the io-cache client-side translator didn't affect printing the
schema but cut the export in half:
* Compressed export: 1m10s
Going off on a tangent, I shut down the remote node. This increased the
performance by an order of magnitude:
* Printing schema: 2s
* Compressed export: 24s
What is the ping time between the servers? Have you measured the
throughput between the servers with something like ftp on big files? Is
it the writes or the reads that slow down? Try dumping to a ext3 from
gluster.
I resumed testing with both servers running. Switching the I/O
scheduler to deadline had no appreciable affect. Neither did adding
client-side io-threads, or server-side write-behind. Surprisingly, I
found that changing read-subvolume to the remove server had only a minor
penalty.
Are you using single process client/server on each node, or separate
client and server processes on both nodes?
Then I noticed that the remote server was listed first in the volfile,
which means that it gets used for the lock server. Swapping the order
in the volfile on one server seemed to cause split-brain errors -- does
the order need to be the same on both servers?
Yes, the first server listed is the lock server. If you list them in
different order, locking will break. The order listed is the locking
server fail-over order.
When I changed both
servers' volfiles to use the active MySQL server as the lock server,
there was a dramatic performance increase, to roughly around the 2s/24s
speed I saw with one server down. (I lost the exact stats.)
In summary, running in single process mode, client-side io-cache, and a
local lock file were the changes that made a significant difference.
That makes sense, especially on the local lock file. The time it takes
to write a lock to page cache is going to be some orders of magnitude
faster than the ping time, even on gigabit ethernet.
Since I'm only going to have one server writing to the filesystem at a
time, I could mount it read-only (or not at all) on the other server.
Would that mean I could safely set data-lock-server-count=0 and
entry-lock-server-count=0 because I can be confident that there won't be
any conflicting writes? I don't want to take unnecessary risks, but it
seems like unnecessary overhead for my use case.
Hmm... If the 1st server fails, the lock server will fail to the next
one, and you fire up MySQL there then. I thought you said it was only
the 2nd server that suffers the penalty. Since the 2nd server will fail
over locking from the 1st if the 1st fails, the performance should be
the same after fail-over. You'll still have the active server being the
lock server.
Gordan