Antw: Re: Deprecating ext4 support

"Steffen Weißgerber" <WeissgerberS@xxxxxxx> · Fri, 15 Apr 2016 11:50:04 +0200

>>> Christian Balzer <chibi@xxxxxxx> schrieb am Donnerstag, 14. April
2016 um
17:00:

> Hello,
> 
> [reduced to ceph-users]
> 
> On Thu, 14 Apr 2016 11:43:07 +0200 Steffen Weißgerber wrote:
> 
>> 
>> 
>> >>> Christian Balzer <chibi@xxxxxxx> schrieb am Dienstag, 12. April
2016
>> >>> um 01:39:
>> 
>> > Hello,
>> > 
>> 
>> Hi,
>> 
>> > I'm officially only allowed to do (preventative) maintenance
during
>> > weekend nights on our main production cluster. 
>> > That would mean 13 ruined weekends at the realistic rate of 1 OSD
per
>> > night, so you can see where my lack of enthusiasm for OSD
recreation
>> > comes from.
>> > 
>> 
>> Wondering extremely about that. We introduced ceph for VM's on RBD
to not
>> have to move maintenance time to night shift.
>> 
> This is Japan. 
> It makes the most anal retentive people/rules in "der alten Heimat"
look
> like a bunch of hippies on drugs.
> 
> Note the preventative and I should have put "officially" in quotes,
like
> that.
> 
> I can do whatever I feel comfortable with on our other production
cluster,
> since there aren't hundreds of customers with very, VERY tight SLAs
on it.
> 
> So if I were to tell my boss that I want to renew all OSDs he'd say
"Sure,
> but at time that if anything goes wrong it will not impact any
customer
> unexpectedly" meaning the official maintenance windows...
> 

For "all OSD's" (at the same time), I would agree. But when we talk
about
changing one by one the effect to a cluster auf x OSD's on y nodes ...
Hmm.

>> My understanding of ceph is that it was also made as reliable
storage in
>> case of hardware failure.
>>
> Reliable, yes. With certain limitations, see below.
>  
>> So what's the difference between maintain an osd and it's failure
in
>> effect for the end user? In both cases it should be none.
>> 
> Ideally, yes.
> Note than an OSD failure can result in slow I/O (to the point of
what
> would be considered service interruption) depending on the failure
mode
> and the various timeout settings.
> 
> So planned and properly executed maintenance has less impact.
> None (or at least not noticeable) IF your cluster has enough
resources
> and/or all the tuning has been done correctly.
> 
>> Maintaining OSD's should be routine so that you're confident that
your
>> application stays save while hardware fails in a amount one
configured
>> unused reserve.
>> 
> IO is a very fickle beast, it may perform splendidly at 2000ops/s
just to
> totally go down the drain at 2100. 
> Knowing your capacity and reserve isn't straightforward, especially
not in
> a live environment as compared to synthetic tests. 
> 
> In short, could that cluster (now, after upgrades and adding a cache
tier)
> handle OSD renewals at any given time?
> Absolutely.
> Will I get an official blessing to do so?
> No effing way.
> 

Understand. A setup with cache tiering is more complex than simple
osd's
with journals on SSD.

But that reminds me to a keynote held by Kris Köhntopp at
the FFG of the GUUG in 2015 were he talked about restarting a huge
MySQL-DB part of the backend of booking.com were he had the choice
to regulary restart die DB which tooks 10-15 minutes or so or kill the
DB
process whereafter the DB recovery tooks only 1-2 minutes.

Having this knowledge, he told, is one thing but being that self
confident
to do it with a good feeling only comes from experience to have it
done
in routine.

Please don't understand me wrong, I'll will not force you to be
reckless.

Another interesting fact, Kris explained, was that the IT was equiped
with
a budget for loss of business due to IT unavailability. And the
management
only intervened when this budget was exhausted.

That's also i kind of reserve an IT-Administrator can work with. But
having
such budget surely depends on a corresponding management mentality.

>> In the end what happens to your cluster, when a complete node
fails?
>> 

> Nothing much, in fact LESS than when an OSD should fail since it
won't
> trigger re-balancing (mon_osd_down_out_subtree_limit = host).
> 

Yes, but does a single osd change can trigger this in your
configuration and
is the amount of data that much for a relevant recovery load?

And the same problem you have is when you extend your cluster, haven't
you?

For me a level of operation with such sorrows would be to change
crushmap
related things (e.g. our tunables are already on bobtail profile).
But mainly because I never did it.

> Regards,
> 
> Christian

Regards

Steffen

> -- 
> Christian Balzer        Network/Systems Engineer                
> chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> http://www.gol.com/

-- 
Klinik-Service Neubrandenburg GmbH
Allendestr. 30, 17036 Neubrandenburg
Amtsgericht Neubrandenburg, HRB 2457
Geschaeftsfuehrerin: Gudrun Kappich
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com