Re: mark out vs crush weight 0

Oliver Dzombic <info@xxxxxxxxxxxxxxxxx> · Thu, 19 May 2016 13:26:33 +0200

Hi,

a sparedisk is a nice idea.

But i think thats something you can also do with a shellscript.

Checking if an osd is down or out and just using your spare disk.

Maybe the programming ressources should not be used for something most
of us can do with a simple shell script checking every 5 seconds the
situation.

----

Maybe better idea ( in my humble opinion ) is to solve this stuff by
optimizing the code in recovery situations.

Currently we have things like

client-op-priority,
recovery-op-priority,
max-backfills,
recovery-max-active and so on

to limit the performance impact in a recovery situation.

And still in a situation of recovery the performance go downhill ( a lot
)  when all OSD's start to refill the to_be_recovered OSD.

In my case, i was removing old HDD's from a cluster.

If i down/out them ( 6 TB drives 40-50% full ) the cluster's performance
will go down very dramatically. So i had to reduce the weight by 0.1
steps to ease this pain, but could not remove it completely.

So i think the tools / code to protect the cluster's performance ( even
in recovery situation ) can be improved.

Of course, on one hand, we want to make sure, that asap the configured
amount of replica's and this way, datasecurity is restored.

But on the other hand, it does not help too much if the recovery
proceedure will impact the cluster's performance on a level where the
useability is too much reduced.

So maybe introcude another config option to controle this ratio ?

To control more effectively how much IOPS/Bandwidth is used ( maybe
streight in numbers in form of an IO ratelimit ) so that administrator's
have the chance to config, according to the hardware environment, the
"perfect" settings for their individual usecase.

Because, right now, when i reduce the weight of a 6 TB HDD, while having
~ 30 OSD's in the cluster, from 1.0 to 0.9, around 3-5% of data will be
moved around the cluster ( replication 2 ).

While its moving, there is a true performance hit on the virtual servers.

So if this could be solved, by a IOPS/HDD Bandwidth rate limit, that i
can simply tell the cluster to use max. 10 IOPS and/or 10 MB/s for the
recovery, then i think it would be a great help for any usecase and
administrator.

Thanks !

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:info@xxxxxxxxxxxxxxxxx

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107

Am 19.05.2016 um 04:57 schrieb Christian Balzer:
> 
> Hello Sage,
> 
> On Wed, 18 May 2016 17:23:00 -0400 (EDT) Sage Weil wrote:
> 
>> Currently, after an OSD has been down for 5 minutes, we mark the OSD 
>> "out", whic redistributes the data to other OSDs in the cluster.  If the 
>> OSD comes back up, it marks the OSD back in (with the same reweight
>> value, usually 1.0).
>>
>> The good thing about marking OSDs out is that exactly the amount of data 
>> on the OSD moves.  (Well, pretty close.)  It is uniformly distributed 
>> across all other devices.
>>
> Others have commented already on how improve your initial suggestion
> (retaining CRUSH weights) etc.
> Let me butt in here with an even more invasive but impact reducing
> suggestion.
> 
> Your "good thing" up there is good as far as total data movement goes, but
> it still can unduly impact client performance when one OSD becomes both
> the target and source of data movement at the same time during
> backfill/recovery. 
> 
> So how about upping the ante with the (of course optional) concept of a
> "spare OSD" per node?
> People are already used to the concept, it also makes a full cluster
> situation massively more unlikely. 
> 
> So expanding on the concept below, lets say we have one spare OSD per node
> by default. 
> It's on a disk of the same size or larger than all the other OSDs in the
> node, it is fully prepared but has no ID yet. 
> 
> So we're experiencing an OSD failure and it's about to be set out by the
> MON, lets consider this sequence (OSD X is the dead, S the spare one:
> 
> 1. Set nobackfill/norecovery
> 2. OSD X gets weighted 0
> 3. OSD X gets set out
> 4. OSD S gets activated with the original weight of X and its ID.
> 5. Unset nobackfill/norecovery
> 
> Now data will flow only to the new OSD, other OSDs will not be subject to
> simultaneous reads and writes by backfills. 
> 
> Of course in case there is no spare available (not replaced yet or
> multiple OSD failures), Ceph can go ahead and do it's usual thing,
> hopefully enhanced by the logic below.
> 
> Alternatively, instead of just limiting the number of backfills per OSD
> make them directionally aware, that is don't allow concurrent read and
> write backfills on the same OSD.
> 
> Regards,
> 
> Christian
>> The bad thing is that if the OSD really is dead, and you remove it from 
>> the cluster, or replace it and recreate the new OSD with a new OSD id, 
>> there is a second data migration that sucks data out of the part of the 
>> crush tree where the removed OSD was.  This move is non-optimal: if the 
>> drive is size X, some data "moves" from the dead OSD to other N OSDs on 
>> the host (X/N to each), and the same amount of data (X) moves off the
>> host (uniformly coming from all N+1 drives it used to live on).  The
>> same thing happens at the layer up: some data will move from the host to
>> peer hosts in the rack, and the same amount will move out of the rack.
>> This is a byproduct of CRUSH's hierarchical placement.
>>
>> If the lifecycle is to let drives fail, mark them out, and leave them 
>> there forever in the 'out' state, then the current behavior is fine, 
>> although over time you'll have lot sof dead+out osds that slow things
>> down marginally.
>>
>> If the procedure is to replace dead OSDs and re-use the same OSD id,
>> then this also works fine.  Unfortunately the tools don't make this easy
>> (that I know of).
>>
>> But if the procedure is to remove dead OSDs, or to remove dead OSDs and 
>> recreate new OSDs in their place, probably with a fresh OSD id, then you 
>> get this extra movement.  In that case, I'm wondering if we should allow 
>> the mons to *instead* se the crush weight to 0 after the osd is down for 
>> too long.  For that to work we need to set a flag so that if the OSD
>> comes back up it'll restore the old crush weight (or more likely make
>> the normal osd startup crush location update do so with the OSDs
>> advertised capacity).  Is it sensible?
>>
>> And/or, anybody have a good idea how the tools can/should be changed to 
>> make the osd replacement re-use the osd id?
>>
>> sage
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com