On Tue, 20 Jan 2015, Gregory Meno wrote: > >> [...] > >> You are right about not really addressing calamari. The thing I need > >> to solve is how to make ceph-crush-location script smart about > >> coexisting with changes to the crush map. > > > > Yep, let's solve that problem first. :) > > So I see solving this problem with Calamari is a precursor to > improving the way this is handled in Ceph. > > How does this sound: > > When Calamari makes a change to the CRUSH map where an OSD gets > reparented to a different CRUSH tree it stores a set of key-value > pairs and physical host in ceph config-key e.g. > > rootA -> hostA -> OSD1, OSD2 > > becomes > > rootA -> hostA -> OSD1 > > rootB -> hostB -> OSD2 > > and > > ceph config-key get 'calamari:1:osd_crush_location:osd.2' = {'paths': > [[root=rootB, host=hostB]], 'physical_host': hostA} > > When the OSD starts up a calamari-specific script sends a mon command > to get the data we persisted in the config-key, if none exists we > return the default crush_path, otherwise if match the physical_host to > the node where this OSD is starting then we return the stored path. If > the host match fails we return the default crush_path so that > hot-plugging continues to work. > > and Calamari sets "osd crush location hook" on all OSDs it manages Hmm, with that logic, I think what we have now will actually work unmodified? If the *actual* crush location is, say, root=a rack=b host=c and the hook says root=foo rack=bb host=c it will make no change. It looks for the innermost (by crush type id) field and if it matches it's a no-op. OTOH if the hook says root=foo rack=bb host=cc then it will move it to a new location. Again, though, we start with the innermost fields and stop once there is a match. So if rack=bb exists but under root=bar, we will end up with root=bar rack=bb host=cc because we stop at the first item that is already present (rack=bb). Mainly this means that if we move a host to a new rack the OSDs won't move themselves around... the admin needs to adjust the crush map explicitly. Anwyay, does that look right? ... If that *doesn't* work, it brings up a couple questions, though... 1) Should this be a 'calamari' override or a generic ceph one? It could go straight into the default hook. That would simplify things. 2) I have some doubts about whether the crush location update via the init script is a good idea. I have a half-finished patch that move this step into the OSD itself so that the init script doesn't block when the mons are down; instead, ceph-osd will start (and maybe fork) as usual and then retry until the mons become available, do the crush update, and then do the rest of its boot sequence. We also avoid duplicating the implementation in the sysvinit script and upstart/systemd helper (which IIRC is somewhat awkward to trigger, the original motivation for this patch). sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html