On Monday 03 January 2005 21:22, Peter T. Breuer wrote: > maarten <maarten@xxxxxxxxxxxx> wrote: > > The chance of a PSU blowing up or lightning striking is, reasonably, much > > less than an isolated disk failure. If this simple fact is not true for > > you > > Oh? We have about 20 a year. Maybe three of them are planned. But > those are the worst ones! - the electrical department's method of > "testing" the lines is to switch off the rails then pulse them up and > down. Surge tests or something. When we can we switch everything off > beforehand. But then we also get to deal with the amateur contributions > from the city power people. It goes on and on below, but this your first paragraph is already striking(!) You actually say that the planned outages are worse than the others! OMG. Who taught you how to plan ? Isn't planning the act of anticipating things, and acting accordingly so as to minimize the impact ? So your planning is so bad that the planned maintenance is actually worse than the impromptu outages. I... I am speechless. Really. You take the cake. But from the rest of your post it also seems you define a "total system failure" as something entirely different as the rest of us (presumably). You count either planned or unplanned outages as failures, whereas most of us would call that downtime, not system failure, let alone "total". If you have a problematic UPS system, or mentally challenged UPS engineers, that does not constitute a failure IN YOUR server. Same for a broken network. Total system failures is where the single computer system we're focussing on goes down or is unresponsive. You can't say "your server" is down when all that is happening is someone pulled the UTP from your remote console...! > Yes, my PhD is in electrical engineering. Have I sent them sarcastic > letters explaining how to test lines using a dummy load? Yes. Does the > physics department also want to place them in a vat of slowly reheating > liquid nitrogen? Yes. Does it make any difference? No. I don't know what you're on about, nor do I really care. I repeat: your UPS or powercompany failing does not constitute a _server_ failure. It is downtime. Downtime != system failure, (although the reverse obviously is). We shall forthwith define a system failure as a state where there are _repairs_ neccessary to the server, for it to start working again. Not just the reconnection of mains plugs. Okay with that ? > > I don't understand your math. For one, percentage is measured from 0 to > > 100, > > No, it's measured from 0 to infinity. Occasionally from negative > infinity to positive infinity. Did I mention that I have two degrees in > pure mathematics? We can discuss nonstandard interpretations of Peano's > axioms then. Sigh. Look up what "per cent" means (it's Latin). Also, since you seem to pride yourself on your leet math skills, remember that your professor said that chance can be between 0 (false) and 1 (true). Two or 12 cannot be an outcome of any probability calculation. > > But besides that, I'd wager that from your list number (3) has, by far, > > the smallest chance of occurring. > > Except of course, that you would lose, since not only did I SAY that it > had the highest chance, but I gave a numerical estimate for it that is > 120 times as high as that I gave for (1). Then your data center cannot seriously call itself that. Or your staff cannot call themselves capable. Choose whatever suits you. 12 outages a year... Bwaaah. Even a random home windows box has less outages than that(!). > > Choosing between (1) and (2) is more difficult, > > Well, I said it doesn't matter, because everything is swamped by (3). Which I disagreed with. I stated (3) is normally the _least_ likely. > > my experiences with IDE disks are definitely that it will take the system > > down, but that is very biased since I always used non-mirrored swap. > > It's the same principle. There exists a common mode for failure. > Bayesian calculations then tell you that there is a strong liklihood of > the whole system coming down in conjunction with the disk coming down. Nope, there isn't. Bayesian or not, hotswap drives on hardware raid cards prove you wrong, day in day out. So either you're talking about linux with md specifically, or you should wake up and smell the coffee. > > > Not in my experience. See above. I'd say each disk has about a 10% > > > failure expectation per year. Whereas I can guarantee that an > > > unexpected system failure will occur about once a month, on every > > > important system. > > There you are. I said it again. You quote yourself and you agree with that. Now why doesn't that surprise me ? > Hey, I even took down my own home server by accident over new year! > Spoiled its 222 day uptime. Your user error hardly counts as total system failure, don't you think ? > > I would not be alone in thinking that figure is VERY high. My uptimes > > It isn't. A random look at servers tells me: > > bajo up 77+00:23, 1 user, load 0.28, 0.39, 0.48 > balafon up 25+08:30, 0 users, load 0.47, 0.14, 0.05 > dino up 77+01:15, 0 users, load 0.00, 0.00, 0.00 > guitarra up 19+02:15, 0 users, load 0.20, 0.07, 0.04 > itserv up 77+11:31, 0 users, load 0.01, 0.02, 0.01 > itserv2 up 20+00:40, 1 user, load 0.05, 0.13, 0.16 > lmserv up 77+11:32, 0 users, load 0.34, 0.13, 0.08 > lmserv2 up 20+00:49, 1 user, load 0.14, 0.20, 0.23 > nbd up 24+04:12, 0 users, load 0.08, 0.08, 0.02 > oboe up 77+02:39, 3 users, load 0.00, 0.00, 0.00 > piano up 77+11:55, 0 users, load 0.00, 0.00, 0.00 > trombon up 24+08:14, 2 users, load 0.00, 0.00, 0.00 > violin up 77+12:00, 4 users, load 0.00, 0.00, 0.00 > xilofon up 73+01:08, 0 users, load 0.00, 0.00, 0.00 > xml up 33+02:29, 5 users, load 0.60, 0.64, 0.67 > > (one net). Looks like a major power outage 77 days ago, and a smaller > event 24 and 20 days ago. The event at 20 days ago looks like > sysadmins. Both Trombon and Nbd survived it and tey're on separate > (different) UPSs. The servers which are up 77 days are on a huge UPS > that Lmserv2 and Itserv2 should also be on, as far as I know. So > somebody took them off the UPS wihin ten minutes of each other. Looks > like maintenance moving racks. Okay, once again: your loss of power has nothing to do with a server failure. You can't say that your engine died and needs repair just because you forgot to fill the gas tank. You just add gas and away you go. No repair. No damage. Just downtime. Inconvenient as it may be, but that is not relevant. > Well, they have no chance to be here. There are several planned power > outs a year for the electrical department to do their silly tricks > with. When that happens they take the weekend over it. First off, since that is planned, it is _your_ job to be there beforehand and properly shutdown all those systems proir to losing the power. Secondly, reevaluate your UPS setup...!!! How is it even possible we're discussing such obvious measures. UPS'es are there for a reason. If your upstream UPS systems are unreliable, then add your own UPSes, one per server if need be. It really isn't rocket science... > > If you have building maintenance people and other random staff that can > > access your server room unattended and unmonitored, you have far worse > > problems than making decicions about raid lavels. IMNSHO. > > Oh, they most certainly can't access the server rooms. The techs would > have done that on their own, but they would (obviously) have needed to > move the machines for that, and turn them off. Ah . But yes, the guy > with the insecticide has the key to everywhere, and is probably a > gardener. I've seen him at it. He sprays all the corners of the > corridors, along the edge of the wall and floor, then does the same > inside the rooms. Oh really. Nice. Do you even realize that since your gardener or whatever can access everything, and will spray stuff around indiscriminately, he could very well incinerate your server room (or the whole building for that matter) It's really very simple. You tell him that he has two options: A) He agrees to only enter the server rooms in case of immediate emergency and will refrain from entering the room without your supervision in all other cases. You let him sign a paper stating as much. or B) You will change the lock on the server room thus disallowing all access. You agree you will personally carry out all 'maintenance' in that room. > The point is that most foul-ups are created by the humans, whether > technoid or gardenoid, or hole-diggeroid. And that is exactly why you should make sure their access is limited ! > > By your description you could almost be the guy the joke with the > > recurring 7 o'clock system crash is about (where the cleaning lady > > unplugs the server every morning in order to plug in her vacuum cleaner) > > ;-) > > Oh, the cleaning ladies do their share of damage. They are required BY > LAW to clean the keyboards. They do so by picking them up in their left > hand at the lower left corner, and rubbing a rag over them. Whoa, what special country are you at ? In my neck of the woods, I can disallow any and all cleaning if I deem it is hazardous to the cleaner and / or the equipment. Next, you'll start telling me that they clean your backup tapes and/or enclosures with a rag and soap and that you are required by law to grant them that right...? Do you think they have cleaning crews in nuclear facilities ? If so, do you think they are allowed (by law, no less) to go even near the control panels that regulate the reactor process ? (nope, I didn't think you did) > Their left hand is where the ctl and alt keys are. > > Solution is not to leave keyboard in the room. Use a whaddyamacallit > switch and attach one keyboard to that whenever one needs to access > anything.. Also use thwapping great power cables one inch thck that > they cannot move. Oh my. Oh my. Oh my. I cannot believe you. Have you ever heard of locking the console, perhaps ?!? You know, the state where nothing else than typing your password will do anything ? You can do that _most_certainly_ with KVM switches, in case your OS is too stubborn to disregard the various three finger combinations we all know. > And I won't mention the learning episodes with the linux debugger monitor > activated by pressing "pause". man xlock. man vlock. djeez... is this newbie time now ? > Once I watched the lady cleaning my office. She SPRAYED the back of the > monitor! I YELPED! I tried to explain to her about voltages, and said > that she would't clean her tv at home that way - oh yes she did! Exactly my point. My suggestion to you (if simply explaining doesn't help): Call the cleaner over to an old unused 14" CRT. Spray a lot of water-based, or better, flammable stuff into and onto the back of it. Wait for the smoke or the sparks to come flying...! stand back and enjoy. ;-) > You may not agree, but you would be rather wrong in persisting in that > idea in face of evidence that you can easily accumulate yourself, like > the figures I randomly checked above. Nope. However, I will admit that -in light of everything you said- your environment is very unsafe, very unreliable and frankly just unfit to house a data center worth its name. I'm sure others will agree with me. You can't just go around saying that 12 power outages per year are _normal_ and expected. You can't pretend something very very wrong is going on at your site. I've experienced 1 (count 'em: one) power outage in our last colo in over four years, and boy did my management give them (the colo facility) hell over it ! > > Not only do disk failures occur more often than full system > > failures, > > No they don't - by about 12 to 1. Only in your world, yes. > > disk failures are also much more time-consuming to recover from. > > No they aren't - we just put in another one, and copy the standard > image over it (or in the case of a server, copy its twin, but then > servers don't blow disks all that often, but when they do they blow > ALL of them as well, as whatever blew one will blow the others in due > course - likely heat). If you had used a colo, you wouldn't have dust lead to a premature fan failure (in my experience). There is no smoking in colo facilities expressly for that reason (and the fire hazard, obviously). But even then, you could remotely monitor the fan health, and /or the temperature. I still stand by my statement: disks are more time consuming than other failures to repair. Motherboards don't need data being restored to them. Much less finding out how complete the data backup was, and verifying that all works again as expected. > > Compare changing a system board or PSU with changing a drive and finding, > > copying and verifying a backup (if you even have one that's 100% up to > > date) > > We have. For one thing we have identical pairs of servers, abslutely > equal, md5summed and checked. The idenity-dependent scripts on them > check who they are on and do the approprate thing depending on who they > find they are on. Good for you. Well planned. It just amazes me now more than ever that the rest of the setup seems so broken / unstable. On the other hand, with 12 power outages yearly, you most definitely need two redundant servers. > > The point here was, disk failures being more common than other > > failures... > > But they aren't. If you have only 25% chance of failure per disk per > year, then that makes system outages much more likely, since they > happen at about one per month (here!). With the emphasis on your word "(here!)", yes. > If it isn't faulty scsi cables, it will be overheating cpus. Dust in > the very dry air here kills all fan bearings within 6 months to one > year. Colo facilities have a strict no smoking rule, and air filters to clean what enters. I can guarantee you that a good fan in a good colo will live 4++ years. Excuse me but dry air, my ***. Clean air is not dependent on dryness. It is dependent on cleanness. > My defense against that is to heavily underclock all machines. Um, yeah. Great thinking. Do you underclock the PSU also, and the disks ? Maybe you could run a scsi 15000 rpm drive at 10000, see what that gives ? Sorry for getting overly sarcastic here, but there really is no end to the stupidities, is there ? > > > No way! I hate tapes. I backup to other disks. > > > > Then for your sake, I hope they're kept offline, in a safe. > > No, they're kept online. Why? What would be the point of having them in > a safe? Then they'd be unavailable! I'll give you a few pointers then: If your disks are online instead of in a safe, they are vulnerable to: * Intrusions / viruses * User / admin error (you yourself stated how often this happens!) * Fire * Lightning strike * Theft > > Change admins. > > Can't. They're as good as they get. Hey, *I* even do the sabotage > sometimes. I'm probably only abut 99% accurate, and I can certainly > write a hundred commands in a day. Every admin makes mistakes. But most see it before it has dire consequences. > > I could understand an admin making typing errors and such, but then again > > that would not usually lead to a total system failure. > > Of course it would. You try working remotely to upgrade the sshd and > finally killing off the old one, only to discover that you kill the > wrong one and lock yourself out, while the deadman script on the server Yes, been there done that... > tries fruitlessly to restart a misconfigured server, and then finally > decides after an hour to give up and reboot as a last resort, then > can't bring itself back up because of something else you did that you > were intending to finish but didn't get the opportunity to. This will happen only once (if you're good), maybe twice (if you're adequate) but if it happens to you three times or more, then you need to find a different line of work, or start drinking less and paying more attention at your work. I'm not kidding. The good admin is not he who never makes mistakes, but he who (quickly) learns from it. > > Some daemon not working, > > sure. Good admins review or test their changes, > > And sometimes miss the problem. Yes, but apache not restarting due to a typo hardly constitutes a system failure. Come on now! > > for one thing, and in most > > cases any such mistake is rectified much simpler and faster than a failed > > disk anyway. Except maybe for lilo errors with no boot media available. > > ;-\ > > Well, you can go out to the site in the middle of the night to reboot! > Changes are made out of working hours so as not to disturb the users. Sometimes, depending on the SLA the client has. In any case, I do tend to schedule complex, error-prone work for when I am at the console. Look, any way you want to turn it, messing with reconfiguring bootmanagers when not at the console is asking for trouble. If you have no other recourse, test it first with a local machine with the exact same setup. For instance, I learned from my sshd error to always start a second sshd on port 2222 prior to killing off the main one. You could also have a 'screen' session running with a sleep 600 followed by some rescue command. Be creative. Be cautious (or paranoid). Learn. > > That is not the whole truth. To be fair, the mechanism works like this: > > With raid, you have a 50% chance the wrong, corrupted, data is used. > > Without raid, thus only having a single disk, the chance of using the > > corrupted data is 100% (obviously, since there is only one source) > > That is one particular spin on it. It is _the_ spin on it. > > Ergo: the same as with a single disk. No change. > > Except that it is not the case. With a single disk you are CERTAIN to > detect the problem (if it is detectable) when you run the fsck at > reboot. With a RAID1 mirror you are only 50% likely to detect the > detectable problem, because you may choose to read the "wrong" (correct > :) disk at the crucial point in the fsck. Then you have to hope that > the right disk fails next, when it fails, or else you will be left holding > the detectably wrong, unchecked data. First off, fsck doesn't usually run at reboot. Just the journal is replayed. Only when severe errors are there, there will be a forced fsck. You're not telling me that you fsck your 600 gigabyte arrays upon each reboot, yes ? It will give you multiple hours added downtime if you do. Secondly, if you _are_ that paranoid about it that you indeed do a fsck, what is keeping you from breaking the mirror, fsck the underlying physical devices and reassemble if all is okay. Added benefit: if all is not well, you get to choose which half of the mirror you decide to keep. Problem solved. And third, I am not too convinced the error detection is able to detect all errors. For starters, if a crash occurred while disk one was completely written but disk two had not yet begun, both checksums would be correct, so no fsck would notice. Secondly, I doubt that the checksum mechanism is that good. It's just a trivial checksum, it's bound to overlook some errors. And finally: If you would indeed end up with the "detectably wrong, unchecked data", you can still run an fsck on it, just as with the single disk. The fsck will repair it (or not), just as with the single disk you would've had. In any case, seen as you do 12 reboots a year :-P the chances a very very slim you hit the wrong ("right") half of the disk at all those 12 times, so you'll surely notice the corruption at some point. Note that despite all this I am all for an enhancement to mdadm providing a method to check the parity for correctness. But this is besides the point. > > No, you have a zero chance of detection, since there is nothing to > > compare TO. > > That is not the case. You have every chance in the world of detecting > it - you know what fsck does. Well, when have you last fsck'ed a terabyte size array without an immediate need for it ? I know I haven't -> my time is too valueable to wait half a day, or more, for that fsck to finish. Maarten - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html