Re: mark out vs crush weight 0

David Turner <david.turner@xxxxxxxxxxxxxxxx> · Wed, 18 May 2016 22:28:55 +0000

>>On 16-05-18 14:23, Sage Weil wrote:
>> Currently, after an OSD has been down for 5 minutes, we mark the OSD
>> "out", whic redistributes the data to other OSDs in the cluster.  If the
>> OSD comes back up, it marks the OSD back in (with the same reweight value,
>> usually 1.0).
>>
>> The good thing about marking OSDs out is that exactly the amount of data
>> on the OSD moves.  (Well, pretty close.)  It is uniformly distributed
>> across all other devices.
>>
>> The bad thing is that if the OSD really is dead, and you remove it from
>> the cluster, or replace it and recreate the new OSD with a new OSD id,
>> there is a second data migration that sucks data out of the part of the
>> crush tree where the removed OSD was.  This move is non-optimal: if the
>> drive is size X, some data "moves" from the dead OSD to other N OSDs on
>> the host (X/N to each), and the same amount of data (X) moves off the host
>> (uniformly coming from all N+1 drives it used to live on).  The same thing
>> happens at the layer up: some data will move from the host to peer hosts
>> in the rack, and the same amount will move out of the rack.  This is a
>> byproduct of CRUSH's hierarchical placement.
>>
>> If the lifecycle is to let drives fail, mark them out, and leave them
>> there forever in the 'out' state, then the current behavior is fine,
>> although over time you'll have lot sof dead+out osds that slow things down
>> marginally.
>>
>> If the procedure is to replace dead OSDs and re-use the same OSD id, then
>> this also works fine.  Unfortunately the tools don't make this easy (that
>> I know of).
>>
>> But if the procedure is to remove dead OSDs, or to remove dead OSDs and
>> recreate new OSDs in their place, probably with a fresh OSD id, then you
>> get this extra movement.  In that case, I'm wondering if we should allow
>> the mons to *instead* se the crush weight to 0 after the osd is down for
>> too long.  For that to work we need to set a flag so that if the OSD comes
>> back up it'll restore the old crush weight (or more likely make the
>> normal osd startup crush location update do so with the OSDs advertised
>> capacity).  Is it sensible?

I love the idea to automatically weight an osd to 0.0 if it is automatically marked out.

Setting the weight to default, it's advertised capacity, would cause data movement in any cluster where people have modified their crush map to balance data placement.  This would impact us fairly negatively as we have a lot of PGs with 0% of the cluster data in them, but we have weighted our maps to account for this and maintain a proper weighting for our clusters.

I like your recommendation of reweighting the osd to what it was before you set it to 0.0.  I think you went with suggesting the default weight when it comes back because it would be much simpler.  I think a good way to handle setting it back to the original weight (and keep it scalable for multiple osds going down) would be to have the mons keep the osd map from right before they reweight an osd to 0.0.  You can generate a crush map from the osd map and use that to find the weight of the osd.

I don't think this will ever be too large of a bloat because this will only hold around an old map for every osd automatically outed that hasn't been removed from the crush map.  The old maps would be deleted when you remove the osd from the crush map or it checks back in.

The osd map already knows the map epoch that an osd was marked out in so you can easily check the current osd map to know which archived osd map to check for the weight of the osd before bringing it back in.

>>
>> And/or, anybody have a good idea how the tools can/should be changed to
>> make the osd replacement re-use the osd id?
>>
>> sage
>maybe something like "ceph-disk prepare /dev/sdX --replace=<old-osd>"
>which would remove old osd and set up new in place of it. I am just not
>sure if bootstrap-osd permissions would be enough for that.
>ceph-deploy could have something similar

What I would be really interested in is some more attention to the crushtool and osdmaptool as we use those to reweight out maps offline without moving data around until we have a new map that we are ready to use.  Right now we have only found a way to effectively use these tools from a cluster that has all PGs in their proper place and can't be used in conjunction with adding or removing osds to the cluster so that our cluster would be fully balanced when it's done backfilling onto the new osds or after we shrink a cluster.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com