On Thu, Dec 27, 2012 at 06:53:46PM -0500, John Mark Walker wrote: > I invite all sorts of disagreeable comments, and I'm all for public > discussion of things - as can be seen in this list's archives. But, for > better or worse, we've chosen the approach that we have. Anyone who would > like to challenge that approach is welcome to take up that discussion with > our developers on gluster-devel. This list is for those who need help > using glusterfs. > > I am sorry that you haven't been able to deploy glusterfs in production. > Discussing how and why glusterfs works - or doesn't work - for particular > use cases is welcome on this list. Starting off a discussion about how > the entire approach is unworkable is kind of counter-productive and not > exactly helpful to those of us who just want to use the thing. For me, the biggest problems with glusterfs are not in its design, feature set or performance; they are around what happens when something goes wrong. As I perceive them, the issues are: 1. An almost total lack of error reporting, beyond incomprehensible entries in log files on a completely different machine, made very difficult to find because they are mixed in with all sorts of other incomprehensible log entries. 2. Incomplete documentation. This breaks down further as: 2a. A total lack of architecture and implementation documentation - such as what the translators are and how they work internally, what a GFID is, what xattrs are stored where and what they mean, and all the on-disk states you can expect to see during replication and healing. Without this level of documentation, it's impossible to interpret the log messages from (1) short of reverse-engineering the source code (which is also very minimalist when it comes to comments); and hence it's impossible to reason about what has happened when the system is misbehaving, and what would be the correct and safe intervention to make. glusterfs 2.x actually had fairly comprehensive internals documentation, but this has all been stripped out in 3.x to turn it into a "black box". Conversely, development on 3.x has diverged enough from 2.x to make the 2.x documentation unusable. 2b. An almost total lack of procedural documentation, such as "to replace a failed server with another one, follow these steps" (which in that case involves manually copying peer UUIDs from one server to another), or "if volume rebalance gets stuck, do this". When you come across any of these issues you end up asking the list, and to be fair the list generally responds promptly and helpfully - but that approach doesn't scale, and doesn't necessarily help if you have a storage problem at 3am. For these reasons, I am holding back from deploying any of the more interesting features of glusterfs, such as replicated volumes and distributed volumes which might grow and need rebalancing. And without those, I may as well go back to standard NFS and rsync. And yes, I have raised a number of bug reports for specific issues, but reporting a bug whenever you come across a problem in testing or production is not the right answer. It seems to me that all these edge and error cases and recovery procedures should already have been developed and tested *as a matter of course*, for a service as critical as storage. I'm not saying there is no error handling in glusterfs, because that's clearly not true. What I'm saying is that any complex system is bound to have states where processes cannot proceed without external assistance, and these cases all need to be tested, and you need to have good error reporting and good documentation. I know I'm not the only person to have been affected, because there is a steady stream of people on this list who are asking for help with how to cope with replication and rebalancing failures. Please don't consider the above as non-constructive. I count myself amongst "those of us who just want to use the thing". But right now, I cannot wholeheartedly recommend it to my colleagues, because I cannot confidently say that I or they would be able to handle the failure scenarios I have already experienced, or other ones which may occur in the future. Regards, Brian.