On Wed, 18 Oct 2017, alan somers wrote: > On Tue, Oct 17, 2017 at 3:22 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > > On Tue, 17 Oct 2017, Yoann Moulin wrote: > >> Hello, > >> > >> I wonder if it's possible to add to ceph the ability to rebuild a new disk automatically freshly added to a slot in replacement of a failed osd > >> disk. > >> > >> I imagine something like adding a flag on disks (identified by-path for example or a way to have a deterministic access to the device) to be > >> auto-reconfigured if detected with no data after the same slot was marked as failed. Only disks identified "by-path" with an auto-rebuild flag > >> activated, and previously marked as failed, can run the auto-reconfiguration. > >> > >> I know it should take some time to find the best way to implement that feature to avoid zapping a disk with data but that would be great to > >> improve maintainability. > > > > The way that we approached this before with ceph-disk was that you would > > prelabel replacement disks as "blank ceph" or similar. That way if we saw > > a ceph labeled disk that hadn't been used yet we would know it was fair > > game. This is a lot more work for the admin (you have to attached each of > > your replacement disks to mark them and then put them in the replacement > > pile) but it is safer. > > > > An alternative model would be to have a reasonably reliable way to > > identify a fresh disk from the factory. I'm honest not sure what new > > disks look like these days (zero? empty NTFS partition?) but we could try > > to recognized "blank" and go from there. > > > > Unfortunately we can't assume blank if we see garbage because the disk > > might be encrypted. > > > > Anyway, assuming that part was sorted out, I think a complete solution > > would query the mons to see what OSD id used to live in that particular > > by-path location/slot and try to re-use that OSD ID. We have still built > > in a manual task here of marking the failed osd "destroyed" since reusing > > the ID means the cluster may make assumptions about PG copies on the OSD > > being lost. > > > > sage > > ZFS handles this with the "autoreplace" pool flag. If that flag is > set, then a blank drive inserted into a slot formerly occupied by a > member of the pool will automatically be used to replace that former > member. The nice thing about this scheme is that it requires no > per-drive effort on the part of the sysadmin, and it doesn't touch > drives inserted into other slots. The less nice thing is that it only > works with SES expanders that provide slot information. I don't like > Sage's second suggestion because it basically takes over the entire > server. If newly inserted blank drives are instantly gobbled up by > Ceph, then they can't be used by anything else. IMHO that kind of > greedy functionality shouldn't be builtin to something as general as > Ceph (though perhaps it could go in a separately installed daemon). I think in our case it would be a separate daemon either way that is monitoring slots and reprovisioning OSDs when appropriate. I like this approach! I think it would involve: - A cluster option, not a pool one, since Ceph pools are about logical data collections, not hardware. - A clearer mapping between OSDs and the block device(s) they consume, and some additional metadata on those devices (in this case, the /dev/disk/by-path string ought to suffice). I know John has done some work here but I think things are still a bit ad hoc. For example, the osd metadata reporting for bluestore devices is pretty unstructured. (We also want a clearer device list and properties for devices for the SMART data reporting.) - A daemon (e.g., ceph-osd-autoreplace) that runs on each machine or a tool that is triggered by udev. It would check for new, empty devices appearing in the locations (as defined by the by-path string) previously occupied by OSDs that are down. If that happens, it can use 'ceph osd safe-to-destroy' to verify whether it is safe to automatically rebuild that OSD. (If not, it might want to raise a health alert, since it's possible the drive that was physically pulled should be preserved until the cluster is sure it doesn't need it.) ? sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html