Re: Best practice with 0.48.2 to take a node into maintenance

Gregory Farnum <greg@xxxxxxxxxxx> · Mon, 3 Dec 2012 12:22:06 -0800

On Mon, Dec 3, 2012 at 12:13 PM, Oliver Francke <Oliver.Francke@xxxxxxxx> wrote:
> if you encounter all BIOS-, POST-, RAID-controller-checks, linux-boot, openvswitch-STP setup and so on, one can imagine, that a reboot takes a "couple-of-minutes", normally with our setup after 30 seconds the cluster shall detect some outage and start to do it's work.
> Everytings fine, but perhaps we could avoid big load in the cluster to remap and re-remap ( "Theme: slow requests") I have to ask in means of QoS for a "better way" ;)
> All that stuff had a big customer impact in the past… Time to ask.

If you know you're going to be doing maintenance that might take a
while, and are going to be closely monitoring your cluster for issues,
it might be appropriate to do:
ceph osd set noout

Which will prevent any OSDs from being marked "out", and thus prevent
any migrations or backfills. You can turn it off again with
ceph osd unset noout

Of course, since this means Ceph won't do any re-replication, so
you'll need to step up your manual monitoring to compensate!
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html