Replace OSD drive without remove/re-add OSD

cedric@xxxxxxxxxxx (Cedric Lemarchand) · Sun, 04 May 2014 12:16:18 +0200

Hi Indra,

Le 04/05/2014 06:11, Indra Pramana a ?crit :
> Would like to share after I tried yesterday, this doesn't work:
>
> > - ceph osd set noout
> > - sudo stop ceph-osd id=12
> > - Replace the drive, and once done:
> > - sudo start ceph-osd id=12
You said a few lines afterwards that a new OSD number is assigned, so
there is a typo here : you would use "sudo start ceph-osd id=new_id" (13
for example). I don't know how you could start the new OSD with the
number of the old one ? (except if you removed it before, please clarify)
> > - ceph osd unset noout
>
> Once drive replaced, we need to ceph-deploy zap and prepare the new
> drive, and a new osd number will be assigned. Remapping will start
> immediately after the new OSD is in the cluster.
I am not a guru, but I think this is an expected behaviour : noout only
prevents OSDs to go out, here a new one in added to the cluster so a
remapping begins.

A thing that come in my mind is that, at the moment of the remapping,
you still have the old OSD marked IN+DOWN, so the remapping is maybe
computed with all the OSD marked IN, and a new remapping is computed
right after you remove the old OSD.

Maybe noin and nobackfill flags could plays well here in order to
*freeze* every actions related to OSD topology changes until everything
has been done, so I would try  :

- set noout + noin and/or nobackfill (maybe noin would suffice, not that
the cluster is naked at this time ...)
- stop old OSD
- remove old OSD
- add new OSD
- unset flags

Cheers

> Then we can safely remove the old OSD and unset noout, and wait for
> recovery completed.
>
> However, the "set noout" in the first place indeed helped to prevent
> remapping to take place when we stop the OSD and replace the drive. So
> it's advisable to use this feature when replacing drive -- unless if
> the drive is already failed and the OSD is already down in the first
> place.
>
> Thank you.
>
>
>
> On Sat, May 3, 2014 at 5:51 PM, Andrey Korolyov <andrey at xdel.ru
> <mailto:andrey at xdel.ru>> wrote:
>
>     On Sat, May 3, 2014 at 4:01 AM, Indra Pramana <indra at sg.or.id
>     <mailto:indra at sg.or.id>> wrote:
>     > Sorry forgot to cc the list.
>     >
>     > On 3 May 2014 08:00, "Indra Pramana" <indra at sg.or.id
>     <mailto:indra at sg.or.id>> wrote:
>     >>
>     >> Hi Andrey,
>     >>
>     >> I actually wanted to try this (instead of remove and readd OSD)
>     to avoid
>     >> remapping of PGs to other OSDs and the unnecessary I/O load.
>     >>
>     >> Are you saying that doing this will also trigger remapping? I
>     thought it
>     >> will just do recovery to replace missing PGs as a result of the
>     drive
>     >> replacement?
>     >>
>     >> Thank you.
>     >>
>
>     Yes, remapping will take place, though it is a bit counterintuitive
>     and I suspect that the roots are the same as with double data
>     placement recalculation with out + rm procedure. Actually Inktank
>     people may answer the question with more details I suppose. Also I
>     think that preserving of the collections may eliminate remap during
>     such kind of refill, though it is not trivial thing to do and I had
>     not experimented with this.
>
>     >> On 2 May 2014 21:02, "Andrey Korolyov" <andrey at xdel.ru
>     <mailto:andrey at xdel.ru>> wrote:
>     >>>
>     >>> On 05/02/2014 03:27 PM, Indra Pramana wrote:
>     >>> > Hi,
>     >>> >
>     >>> > May I know if it's possible to replace an OSD drive without
>     removing /
>     >>> > re-adding back the OSD? I want to avoid the time and the
>     excessive I/O
>     >>> > load which will happen during the recovery process at the
>     time when:
>     >>> >
>     >>> > - the OSD is removed; and
>     >>> > - the OSD is being put back into the cluster.
>     >>> >
>     >>> > I read David Zafman's comment on this thread, that we can
>     set "noout",
>     >>> > take OSD "down", replace the drive, and then bring the OSD
>     back "up"
>     >>> > and
>     >>> > unset "noout".
>     >>> >
>     >>> > http://www.spinics.net/lists/ceph-users/msg05959.html
>     >>> >
>     >>> > May I know if it's possible to do this?
>     >>> >
>     >>> > - ceph osd set noout
>     >>> > - sudo stop ceph-osd id=12
>     >>> > - Replace the drive, and once done:
>     >>> > - sudo start ceph-osd id=12
>     >>> > - ceph osd unset noout
>     >>> >
>     >>> > The cluster was built using ceph-deploy, can we just replace
>     a drive
>     >>> > like that without zapping and preparing the disk using
>     ceph-deploy?
>     >>> >
>     >>>
>     >>> There will be absolutely no quirks except continuous remapping
>     with
>     >>> peering along entire recovery process. If your cluster may
>     meet this
>     >>> well, there is absolutely no problem to go through this flow.
>     Otherwise,
>     >>> in longer out+in flow, there are only two short intensive
>     recalculations
>     >>> which can be done at the scheduled time, comparing with
>     peering during
>     >>> remap, which can introduce unnecessary I/O spikes.
>     >>>
>     >>> > Looking forward to your reply, thank you.
>     >>> >
>     >>> > Cheers.
>     >>> >
>     >>> >
>     >>> > _______________________________________________
>     >>> > ceph-users mailing list
>     >>> > ceph-users at lists.ceph.com <mailto:ceph-users at lists.ceph.com>
>     >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >>> >
>     >>>
>     >
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
C?dric

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140504/3b6a8461/attachment.htm>