Re: Wish list : automatic rebuild with hot swap osd ?

John Spray <jspray@xxxxxxxxxx> · Thu, 19 Oct 2017 13:30:51 +0100

On Wed, Oct 18, 2017 at 5:25 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> On Wed, 18 Oct 2017, alan somers wrote:
>> On Tue, Oct 17, 2017 at 3:22 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>> > On Tue, 17 Oct 2017, Yoann Moulin wrote:
>> >> Hello,
>> >>
>> >> I wonder if it's possible to add to ceph the ability to rebuild a new disk automatically freshly added to a slot in replacement of a failed osd
>> >> disk.
>> >>
>> >> I imagine something like adding a flag on disks (identified by-path for example or a way to have a deterministic access to the device) to be
>> >> auto-reconfigured if detected with no data after the same slot was marked as failed. Only disks identified "by-path" with an auto-rebuild flag
>> >> activated, and previously marked as failed, can run the auto-reconfiguration.
>> >>
>> >> I know it should take some time to find the best way to implement that feature to avoid zapping a disk with data but that would be great to
>> >> improve maintainability.
>> >
>> > The way that we approached this before with ceph-disk was that you would
>> > prelabel replacement disks as "blank ceph" or similar.  That way if we saw
>> > a ceph labeled disk that hadn't been used yet we would know it was fair
>> > game.  This is a lot more work for the admin (you have to attached each of
>> > your replacement disks to mark them and then put them in the replacement
>> > pile) but it is safer.
>> >
>> > An alternative model would be to have a reasonably reliable way to
>> > identify a fresh disk from the factory.  I'm honest not sure what new
>> > disks look like these days (zero? empty NTFS partition?) but we could try
>> > to recognized "blank" and go from there.
>> >
>> > Unfortunately we can't assume blank if we see garbage because the disk
>> > might be encrypted.
>> >
>> > Anyway, assuming that part was sorted out, I think a complete solution
>> > would query the mons to see what OSD id used to live in that particular
>> > by-path location/slot and try to re-use that OSD ID.  We have still built
>> > in a manual task here of marking the failed osd "destroyed" since reusing
>> > the ID means the cluster may make assumptions about PG copies on the OSD
>> > being lost.
>> >
>> > sage
>>
>> ZFS handles this with the "autoreplace" pool flag.  If that flag is
>> set, then a blank drive inserted into a slot formerly occupied by a
>> member of the pool will automatically be used to replace that former
>> member.  The nice thing about this scheme is that it requires no
>> per-drive effort on the part of the sysadmin, and it doesn't touch
>> drives inserted into other slots.  The less nice thing is that it only
>> works with SES expanders that provide slot information.  I don't like
>> Sage's second suggestion because it basically takes over the entire
>> server.  If newly inserted blank drives are instantly gobbled up by
>> Ceph, then they can't be used by anything else.  IMHO that kind of
>> greedy functionality shouldn't be builtin to something as general as
>> Ceph (though perhaps it could go in a separately installed daemon).
>
> I think in our case it would be a separate daemon either way that is
> monitoring slots and reprovisioning OSDs when appropriate.  I like this
> approach!  I think it would involve:
>
> - A cluster option, not a pool one, since Ceph pools are about
> logical data collections, not hardware.
>
> - A clearer mapping between OSDs and the block device(s) they consume, and
> some additional metadata on those devices (in this case, the
> /dev/disk/by-path string ought to suffice).  I know John has done some
> work here but I think things are still a bit ad hoc.  For example, the osd
> metadata reporting for bluestore devices is pretty unstructured.  (We also
> want a clearer device list and properties for devices for the SMART
> data reporting.)
>
> - A daemon (e.g., ceph-osd-autoreplace) that runs on each machine or a
> tool that is triggered by udev.  It would check for new, empty devices
> appearing in the locations (as defined by the by-path string) previously
> occupied by OSDs that are down.  If that happens, it can use 'ceph osd
> safe-to-destroy' to verify whether it is safe to automatically rebuild
> that OSD.  (If not, it might want to raise a health alert, since it's
> possible the drive that was physically pulled should be preserved until
> the cluster is sure it doesn't need it.)

It's a neat idea... I'm trying to get over my instinctive discomfort
with tools that format drives without administrator intervention!

At the point that we have a daemon that is detecting new devices, it
might be better to have it report those devices to something central,
so that we can prompt the user with a "I see a new drive that looks
like a replacement, shall we go for it?" and the user can either say
yes, or flag that the drive should be ignored by Ceph.

Building such a daemon feels like quite a significant step: once it's
there it would be awfully tempting to use it for other things.  It
depends whether we want to own that piece, or whether we would rather
hold out for container environments that can report drives to us and
thereby avoid the need for our own drive detecting daemon.

John

>
> ?
>
> sage
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html