Re: new OSD re-using old OSD id fails to boot

Sage Weil <sweil@xxxxxxxxxx> · Wed, 9 Dec 2015 12:08:30 -0800 (PST)

On Wed, 9 Dec 2015, David Zafman wrote:
> On 12/9/15 2:39 AM, Wei-Chung Cheng wrote:
> > Hi Loic,
> > 
> > I try to reproduce this problem on my CentOS7.
> > I can not do the same issue.
> > This is my version:
> > ceph version 10.0.0-928-g8eb0ed1 (8eb0ed1dcda9ee6180a06ee6a4415b112090c534)
> > Would you describe more detail?
> > 
> > 
> > Hi David, Sage,
> > 
> > In most of time, when we found the osd failure, the OSD is already in
> > `out` state.
> > It could not avoid the redundant data movement unless we could set the
> > osd noout when failure.
> > Is it right? (Means if OSD go into `out` state, it will make some
> > redundant data movement)
> Yes, one case would be that during the 5 minute down window of an OSD disk
> failure, the noout flag can be set if a spare disk is available.  Another
> scenario would be a bad SMART status or noticing EIO errors from a disk
> prompting a replacement.  So if a spare disk is already installed or you have
> hot swappable drives, it would be nice to replace the drive and let recovery
> copy back all the data that should be there.  Using noout would be critical to
> this effort.
> 
> I don't understand why Sage suggests below that a down+out phase would be
> required during the replacement.

Hmm, I wasn't thinking about a hot spare scenario.  We've always assumed 
that there is no point to hot spares--you may as well have them 
participating in the cluster, doing useful work, and let the failure 
rebalance distributed across all disks (and not hammer the replacement).

sage

> > 
> > Could we try the traditional spare behavior? (Set some disks backup
> > and auto replace the broken device?)
> > 
> > That can replace the failure osd before it go into the `out` state.
> > Or we could always set the osd noout?
> > 
> > In fact, I think these is a different problems between David and Loic.
> > (these two problems are the same import :p
> > 
> > If you have any problems, feel free to let me know.
> > 
> > thanks!!
> > vicente
> > 
> > 
> > 2015-12-09 10:50 GMT+08:00 Sage Weil <sweil@xxxxxxxxxx>:
> > > On Tue, 8 Dec 2015, David Zafman wrote:
> > > > Remember I really think we want a disk replacement feature that would
> > > > retain
> > > > the OSD id so that it avoids unnecessary data movement.  See tracker
> > > > http://tracker.ceph.com/issues/13732
> > > Yeah, I totally agree.  We just need to form an opinion on how... probably
> > > starting with the user experience.  Ideally we'd go from up + in to down +
> > > in to down + out, then pull the drive and replace, and then initialize a
> Here ^^^^^^^^^^^^
> > > new OSD with the same id... and journal partition.  Something like
> > > 
> > >    ceph-disk recreate id=N uuid=U <osd device path>
> > > 
> > > I.e., it could use the uuid (which the cluster has in the OSDMap) to find
> > > (and re-use) the journal device.
> > > 
> > > For a journal failure it'd probably be different.. but maybe not?
> > > 
> > > Any other ideas?
> > > 
> > > sage
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html