On Fri, 25 Apr 2008, Gareth Bult wrote:
Well here's the thing. I've tried to apply Gluster in 8 different "real world" scenario's, and each time I've failed either because of bugs or because "this simply isn't what GlusterFS is designed for".
[...]
Suggesting that I'm either not tuning it properly or should be using an alternative filesystem I'm afraid is a bit of a cop-out. There are real problems here and saying "yes but Gluster is only designed to work in specific instances" is frankly a bit daft, and if this were the case, instead of a heavy sales pitch on the website along the lines of "Gluster is wonderful and does everything", it should be saying "Gluster will do x, y and z, only."
The impression I got from the site is that it isn't yet very mature, but is usable. IMO, it stops way short of the "Gluster is wonderful and does everything" claim.
Now, Zope is a long-standing web based application server that I've been using for nearly 10 years, telling me it's "excessive" really doesn't fly. Trying to back up a gluster AFR with rsync runs into similar problems when you have lots of small files - it takes way longer than it should do.
How many nodes have you got? Have you tried running it with RHCS+GFS in an otherwise similar setup? If so, how did the performance compare?
Moving to the other end of the scale, AFR can't cope with large files either .. handling of sparse files doesn't work properly and self-heal has no concept of repairing part of a file .. so sticking a 20Gb file on a GlusterFS is just asking for trouble as every time you restart a gluster server (or every time one crashes) it'll crucify your network.
I thought about this, and there isn't really a way to do anything about this, unless you relax the constraints. You could to a rsync-type rolling checksum block-sync, but this would both take up more CPU time and result in theoretical scope for the file to not be the same on both ends. Whether this minute possibility of corruption that the hashing algorithm doedn't pick up is a reasonable trade-off, I don't know. Perhaps if such a thing were implemented it should be made optional.
Now, a couple of points; a. With regards to metadata, given two volumes mirrored via AFR, please can you explain to me why it's ok to do a data read operation against one node only, but not a metadata read operation .. and what would break if you read metadata from only one volume?
The fact that the file may have been deleted or modified when you try to open it. File's content is a feature of the file. Whether the file is there and/or up to date is a feature of the metadata of the file and it's parent directory. If you start loosening this, you might as well disconnect the nodes and run them in a deliberate split-brain case and resync periodically with all the conflict and data loss that entails.
b. Looking back through the list, Gluster's non-caching mechanism for acquiring file-system information seems to be at the root of many of it's performance issues. Is there no mileage in trying to address this issue ?
How would you propose to obtain the full posix locking/consistency without this? Look at the similar alternatives like DRBD + [GFS | OCFS2]. They either require shared storage (SAN) or block level replicated FS (DRBD). Split-braining in those cases is a non-option, and you need 100% functional fencing to forcefully disable the failed node or risk extensive corruption. GlusterFS being file-based works around the risk of trashing the entire FS on the block device. Having shared/replicated storage block device works around a part of the problem because all the underlying data is replicated, but you'll find that GFS and OCFS2 also suffer similar performance penalties with lots of small files due to locking, especially on directory level. If anything, the design of GlusterFS is better for that scenario.
Since in GFS there is no scope for split-brain operation, you can guarantee that everything that was written is what is accessible. This the main source of contention is the write-locks. In GlusterFS the split-brain requirement is relaxed, but to compensate for this in order to maintain FS consistency, the metadata has to be checked each time. If you need this relaxed further, then you have to move away from the posix locking requirements, which puts you out of the realm of GlusterFS use-cases and into a more WAN-directed FS like Coda.
c. If I stop one of my two servers, AFR suddenly speeds up "a lot" ! Would it be so bad if there were an additional option "subvolume-read-meta" ? This would probably involve only a handful of additional lines of code, if that .. ?
How are your clients and servers organized? Are you using server-server based AFR? Or do you have clients doing the AFR-ing? Do you have more clients than servers? Have you tried adjusting the timeout options to glusterfs (-a, -e)?
Gordan