>>> Christian Balzer <chibi@xxxxxxx> schrieb am Donnerstag, 14. April 2016 um 17:00: > Hello, > > [reduced to ceph-users] > > On Thu, 14 Apr 2016 11:43:07 +0200 Steffen Weißgerber wrote: > >> >> >> >>> Christian Balzer <chibi@xxxxxxx> schrieb am Dienstag, 12. April 2016 >> >>> um 01:39: >> >> > Hello, >> > >> >> Hi, >> >> > I'm officially only allowed to do (preventative) maintenance during >> > weekend nights on our main production cluster. >> > That would mean 13 ruined weekends at the realistic rate of 1 OSD per >> > night, so you can see where my lack of enthusiasm for OSD recreation >> > comes from. >> > >> >> Wondering extremely about that. We introduced ceph for VM's on RBD to not >> have to move maintenance time to night shift. >> > This is Japan. > It makes the most anal retentive people/rules in "der alten Heimat" look > like a bunch of hippies on drugs. > > Note the preventative and I should have put "officially" in quotes, like > that. > > I can do whatever I feel comfortable with on our other production cluster, > since there aren't hundreds of customers with very, VERY tight SLAs on it. > > So if I were to tell my boss that I want to renew all OSDs he'd say "Sure, > but at time that if anything goes wrong it will not impact any customer > unexpectedly" meaning the official maintenance windows... > For "all OSD's" (at the same time), I would agree. But when we talk about changing one by one the effect to a cluster auf x OSD's on y nodes ... Hmm. >> My understanding of ceph is that it was also made as reliable storage in >> case of hardware failure. >> > Reliable, yes. With certain limitations, see below. > >> So what's the difference between maintain an osd and it's failure in >> effect for the end user? In both cases it should be none. >> > Ideally, yes. > Note than an OSD failure can result in slow I/O (to the point of what > would be considered service interruption) depending on the failure mode > and the various timeout settings. > > So planned and properly executed maintenance has less impact. > None (or at least not noticeable) IF your cluster has enough resources > and/or all the tuning has been done correctly. > >> Maintaining OSD's should be routine so that you're confident that your >> application stays save while hardware fails in a amount one configured >> unused reserve. >> > IO is a very fickle beast, it may perform splendidly at 2000ops/s just to > totally go down the drain at 2100. > Knowing your capacity and reserve isn't straightforward, especially not in > a live environment as compared to synthetic tests. > > In short, could that cluster (now, after upgrades and adding a cache tier) > handle OSD renewals at any given time? > Absolutely. > Will I get an official blessing to do so? > No effing way. > Understand. A setup with cache tiering is more complex than simple osd's with journals on SSD. But that reminds me to a keynote held by Kris Köhntopp at the FFG of the GUUG in 2015 were he talked about restarting a huge MySQL-DB part of the backend of booking.com were he had the choice to regulary restart die DB which tooks 10-15 minutes or so or kill the DB process whereafter the DB recovery tooks only 1-2 minutes. Having this knowledge, he told, is one thing but being that self confident to do it with a good feeling only comes from experience to have it done in routine. Please don't understand me wrong, I'll will not force you to be reckless. Another interesting fact, Kris explained, was that the IT was equiped with a budget for loss of business due to IT unavailability. And the management only intervened when this budget was exhausted. That's also i kind of reserve an IT-Administrator can work with. But having such budget surely depends on a corresponding management mentality. >> In the end what happens to your cluster, when a complete node fails? >> > Nothing much, in fact LESS than when an OSD should fail since it won't > trigger re-balancing (mon_osd_down_out_subtree_limit = host). > Yes, but does a single osd change can trigger this in your configuration and is the amount of data that much for a relevant recovery load? And the same problem you have is when you extend your cluster, haven't you? For me a level of operation with such sorrows would be to change crushmap related things (e.g. our tunables are already on bobtail profile). But mainly because I never did it. > Regards, > > Christian Regards Steffen > -- > Christian Balzer Network/Systems Engineer > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications > http://www.gol.com/ -- Klinik-Service Neubrandenburg GmbH Allendestr. 30, 17036 Neubrandenburg Amtsgericht Neubrandenburg, HRB 2457 Geschaeftsfuehrerin: Gudrun Kappich _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com