+1 for 2b. I am in de planning stages for an RHS 2.0 deployement and I too have suggested a "cookbook" style guide for step-by-step procedures to my RedHat Solution Architect. What can I do to have this upped in the prio-list? Cheers, Fred On Wed, Jan 2, 2013 at 12:49 PM, Brian Candler <B.Candler at pobox.com> wrote: > On Thu, Dec 27, 2012 at 06:53:46PM -0500, John Mark Walker wrote: > > I invite all sorts of disagreeable comments, and I'm all for public > > discussion of things - as can be seen in this list's archives. But, for > > better or worse, we've chosen the approach that we have. Anyone who > would > > like to challenge that approach is welcome to take up that discussion > with > > our developers on gluster-devel. This list is for those who need help > > using glusterfs. > > > > I am sorry that you haven't been able to deploy glusterfs in production. > > Discussing how and why glusterfs works - or doesn't work - for particular > > use cases is welcome on this list. Starting off a discussion about how > > the entire approach is unworkable is kind of counter-productive and not > > exactly helpful to those of us who just want to use the thing. > > For me, the biggest problems with glusterfs are not in its design, feature > set or performance; they are around what happens when something goes wrong. > As I perceive them, the issues are: > > 1. An almost total lack of error reporting, beyond incomprehensible entries > in log files on a completely different machine, made very difficult to find > because they are mixed in with all sorts of other incomprehensible log > entries. > > 2. Incomplete documentation. This breaks down further as: > > 2a. A total lack of architecture and implementation documentation - such as > what the translators are and how they work internally, what a GFID is, what > xattrs are stored where and what they mean, and all the on-disk states you > can expect to see during replication and healing. Without this level of > documentation, it's impossible to interpret the log messages from (1) short > of reverse-engineering the source code (which is also very minimalist when > it comes to comments); and hence it's impossible to reason about what has > happened when the system is misbehaving, and what would be the correct and > safe intervention to make. > > glusterfs 2.x actually had fairly comprehensive internals documentation, > but > this has all been stripped out in 3.x to turn it into a "black box". > Conversely, development on 3.x has diverged enough from 2.x to make the 2.x > documentation unusable. > > 2b. An almost total lack of procedural documentation, such as "to replace a > failed server with another one, follow these steps" (which in that case > involves manually copying peer UUIDs from one server to another), or "if > volume rebalance gets stuck, do this". When you come across any of these > issues you end up asking the list, and to be fair the list generally > responds promptly and helpfully - but that approach doesn't scale, and > doesn't necessarily help if you have a storage problem at 3am. > > For these reasons, I am holding back from deploying any of the more > interesting features of glusterfs, such as replicated volumes and > distributed volumes which might grow and need rebalancing. And without > those, I may as well go back to standard NFS and rsync. > > And yes, I have raised a number of bug reports for specific issues, but > reporting a bug whenever you come across a problem in testing or production > is not the right answer. It seems to me that all these edge and error > cases > and recovery procedures should already have been developed and tested *as a > matter of course*, for a service as critical as storage. > > I'm not saying there is no error handling in glusterfs, because that's > clearly not true. What I'm saying is that any complex system is bound to > have states where processes cannot proceed without external assistance, and > these cases all need to be tested, and you need to have good error > reporting > and good documentation. > > I know I'm not the only person to have been affected, because there is a > steady stream of people on this list who are asking for help with how to > cope with replication and rebalancing failures. > > Please don't consider the above as non-constructive. I count myself amongst > "those of us who just want to use the thing". But right now, I cannot > wholeheartedly recommend it to my colleagues, because I cannot confidently > say that I or they would be able to handle the failure scenarios I have > already experienced, or other ones which may occur in the future. > > Regards, > > Brian. > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://supercolony.gluster.org/mailman/listinfo/gluster-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130102/5794c9ae/attachment.html>