On 03/05/2013 09:57 AM, Brian Candler wrote: > On Tue, Mar 05, 2013 at 08:33:28AM -0800, Joe Julian wrote: >> It comes up on this list from time to time that there's not >> sufficient documentation on troubleshooting. I assume that's what >> some people mean when they refer to disappointing documentation as >> the current documentation is far more detailed and useful than it >> was 3 years ago when I got started. I'm not really sure what's being >> asked for here, nor am I sure how one would document how to >> troubleshoot. In my mind, if there's a trouble that can be >> documented with a clear path to resolution, then a bug report should >> be filed and that should be fixed. Any other cases that cannot be >> coded for require human intervention and are already documented. > When people come to this list and say "I am seeing split brain errors" or > "ls shows question marks for file attributes" Article(s) on the official Q&A Site but that [censored] site can't find it with a search. Grrr. > or "I need to replace a failed server with a new one" Article also on the official Q&A Site but again search isn't finding them. I'll try to grab the contents of those and paste them into the wiki somewhere (unless you do it first. It is a wiki after all). > or "probing a server fails", Agreed. This would be good. Does anyone actually know how to answer this? Please write it up on the wiki. I know I even have trouble sometimes figuring out why someone's probe fails. > I don't think there's > any official documentation to help them. > > "Documenting how to troubleshoot" would include what log messages you should > look for and what they mean, what xattrs you should expect to see on the > bricks and what they mean (for each case of distributed, replicated etc). > Given a basic checklist of these things, it would be easy for users to > report to the list "I checked A, B and C and the output from B was XXXX when > the docs say it should be YYYY on a working system", which is at least a > starting point. This is where all open source seems to hit problems. Sure, there's error messages (at least they're not "Error ##" like mysql does...) but they seem to generally only make sense to whomever wrote the software. There are 7216 log entries in the source. That's a lot of man-hours to document all of those even without any degree of detail. Now, there are only 136 critical errors but I'm not sure I've ever seen one of those. 2991 at the level of "error" so I'm really not sure how that could be handled. Even if someone could volunteer 8 hours/day to spend 15 minutes describing each error message, it would take them around 4 1/2 months. That's longer than a production cycle (granted, once they were documented the production cycle would be unlikely to produce nearly 3000 new error messages). I'd be willing to make the list and document 1 or 2 a day. Anyone else? > As far as I'm aware, the official admin guide is completely oblivious to > internals like this. > > Users may be able to find suggestions by perusing mailing list archives, or > by trying gluster 2.x wiki documentation (which may be stale), or some blog > postings. Thanks for pointing these out. Some I (obviously) wasn't even aware were a problem. By the way - if anyone wants to copy-paste stuff from my blog into the wiki, feel free. I keep meaning to but have been behind schedule at work and just haven't had enough free time lately.