Re: Wish list : automatic rebuild with hot swap osd ?

Alfredo Deza <adeza@xxxxxxxxxx> · Thu, 19 Oct 2017 08:14:26 -0400

On Wed, Oct 18, 2017 at 12:25 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> On Wed, 18 Oct 2017, alan somers wrote:
>> On Tue, Oct 17, 2017 at 3:22 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>> > On Tue, 17 Oct 2017, Yoann Moulin wrote:
>> >> Hello,
>> >>
>> >> I wonder if it's possible to add to ceph the ability to rebuild a new disk automatically freshly added to a slot in replacement of a failed osd
>> >> disk.
>> >>
>> >> I imagine something like adding a flag on disks (identified by-path for example or a way to have a deterministic access to the device) to be
>> >> auto-reconfigured if detected with no data after the same slot was marked as failed. Only disks identified "by-path" with an auto-rebuild flag
>> >> activated, and previously marked as failed, can run the auto-reconfiguration.
>> >>
>> >> I know it should take some time to find the best way to implement that feature to avoid zapping a disk with data but that would be great to
>> >> improve maintainability.
>> >
>> > The way that we approached this before with ceph-disk was that you would
>> > prelabel replacement disks as "blank ceph" or similar.  That way if we saw
>> > a ceph labeled disk that hadn't been used yet we would know it was fair
>> > game.  This is a lot more work for the admin (you have to attached each of
>> > your replacement disks to mark them and then put them in the replacement
>> > pile) but it is safer.
>> >
>> > An alternative model would be to have a reasonably reliable way to
>> > identify a fresh disk from the factory.  I'm honest not sure what new
>> > disks look like these days (zero? empty NTFS partition?) but we could try
>> > to recognized "blank" and go from there.
>> >
>> > Unfortunately we can't assume blank if we see garbage because the disk
>> > might be encrypted.
>> >
>> > Anyway, assuming that part was sorted out, I think a complete solution
>> > would query the mons to see what OSD id used to live in that particular
>> > by-path location/slot and try to re-use that OSD ID.  We have still built
>> > in a manual task here of marking the failed osd "destroyed" since reusing
>> > the ID means the cluster may make assumptions about PG copies on the OSD
>> > being lost.
>> >
>> > sage
>>
>> ZFS handles this with the "autoreplace" pool flag.  If that flag is
>> set, then a blank drive inserted into a slot formerly occupied by a
>> member of the pool will automatically be used to replace that former
>> member.  The nice thing about this scheme is that it requires no
>> per-drive effort on the part of the sysadmin, and it doesn't touch
>> drives inserted into other slots.  The less nice thing is that it only
>> works with SES expanders that provide slot information.  I don't like
>> Sage's second suggestion because it basically takes over the entire
>> server.  If newly inserted blank drives are instantly gobbled up by
>> Ceph, then they can't be used by anything else.  IMHO that kind of
>> greedy functionality shouldn't be builtin to something as general as
>> Ceph (though perhaps it could go in a separately installed daemon).
>
> I think in our case it would be a separate daemon either way that is
> monitoring slots and reprovisioning OSDs when appropriate.  I like this
> approach!  I think it would involve:
>
> - A cluster option, not a pool one, since Ceph pools are about
> logical data collections, not hardware.
>
> - A clearer mapping between OSDs and the block device(s) they consume, and
> some additional metadata on those devices (in this case, the
> /dev/disk/by-path string ought to suffice).

I am hesitant to rely on by-path here, those can change if devices
change ports. While testing ceph-volume to make it
resilient on device name changes, we  weren't able to rely on by-path.
This is easy to test on a VM and changing the port
number where the disk is being plugged in.

Unless we are describing a scenario where the ports wouldn't change,
the system would not reboot, and a bad-one-out / good-one-in would
always be the case?

> I know John has done some
> work here but I think things are still a bit ad hoc.  For example, the osd
> metadata reporting for bluestore devices is pretty unstructured.  (We also
> want a clearer device list and properties for devices for the SMART
> data reporting.)

If these properties also mean device information (vendor, size,
solid/rotational, etc...) it could help
to better map/detect an OSD replacement since clusters tend to have a
certain level of
homogeneous hardware: if $brand, and $size, and $rotational etc...

>
> - A daemon (e.g., ceph-osd-autoreplace) that runs on each machine or a
> tool that is triggered by udev.  It would check for new, empty devices
> appearing in the locations (as defined by the by-path string) previously
> occupied by OSDs that are down.  If that happens, it can use 'ceph osd
> safe-to-destroy' to verify whether it is safe to automatically rebuild
> that OSD.  (If not, it might want to raise a health alert, since it's
> possible the drive that was physically pulled should be preserved until
> the cluster is sure it doesn't need it.)

systemd has some support for devices, so we might not even need a
daemon, but more a unit that can
depend on events already handled by systemd (would save us from udev).

>
> ?
>
> sage
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html