On Fri, Jan 16, 2015 at 12:39 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: > > [adding ceph-devel, ceph-calamari] > > On Fri, 16 Jan 2015, John Spray wrote: > > Ideally we would have a solution that preserved the OSD hot-plugging > > ability (it's a neat feature). > > > > Perhaps the crush location logic should be: > > * If nobody ever overrode me, default behaviour > > * If someone (calamari) set an explicit location, preserve that > > * UNLESS I am on a different hostname than I was when the explicit > > location was set, in which case kick in the hotplug behaviour > > This would be nice... I agree this sounds fine, and easy enough to explain. > > > > The hotplug path might just be to reset my location in the existing > > way, or if calamari was really clever it could define how to handle a > > hostname change within a different root (typically the 'ssd' root > > people create) such that if I unplugged ssd_root->myhost_ssd and > > plugged it into foohost, then it would reset its crush location to > > ssd_root->foohost_ssd instead of root->foohost. > > > > We might want to consider adding a flag into the crush map itself so > > that nodes can be "locked" to indicate that their location was set by > > human intent rather than the crush-location script. > > Perhaps a per-osd flag in the OSDMap? We have a field for this right now, > although none of the fields are user-modifiable (they are things up up and > exists). I think that makes the most sense. So If I understand this correctly we are talking about adding data to the CRUSH map for the crush-location script to read. It appears to not talk to the cluster presently ubuntu@vpm148:~$ strace ceph-crush-location --cluster ceph --id 0 --type osd 2>&1 | grep ^open open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3 open("/usr/bin/ceph-crush-location", O_RDONLY) = 3 > > > We may also be able to avoid the pain in some cases if we bite the bullet > and standardize how to handle parallel hdd vs ssd vs whatever trees. Two > approaches have come up that come to mind: I have always thought it strange to have multiple trees that duplicate things with one physical presence. e.g. foohost-spinning-disk and foohost-SSD. It seems to me if we are going to change things we should be tagging the OSDs with SSD, I don't think we should restrict the tagging to them. It seems to be a good option for the future if each node were able to be tagged in some way. It seems that there are things on a host that you might care to tag like networking capability 10Gbit vs 1Gbit links. This suggestion would introduce a need to migrate existing crush-maps and rules perhaps more than we want. Maybe we take steps in this direction so we can be here at ceph K series. > > > 1) Make a tree like > > root ssd > host host1:ssd > osd.0 > osd.1 > host host2:ssd > osd.2 > osd.3 > root sata > host host1:sata > osd.4 > osd.5 > host host2:sata > osd.6 > osd.7 > > where we 'standardize' (by convention) : as a separator between name and > device type. Then we could modify the crush location process to take a > 'host=host1' location and current host of host1:ssd as a match and make no > change. > > 2) Make the per-type tree generation programatic. So you would build a > single tree like this: > > root default > host host1 > devicetype ssd > osd.0 > osd.1 > devicetype hdd > osd.4 > osd.5 > host host2 > devicetype ssd > osd.2 > osd.3 > devicetype hdd > osd.6 > osd.7 > > and then on any map change a function in the monitor would programatically > create a set of per-type trees in the same map: > > root default > host host1 > devicetype ssd > osd.0 > osd.1 > devicetype hdd > osd.4 > osd.5 > host host2 > devicetype ssd > osd.2 > osd.3 > devicetype hdd > osd.6 > osd.7 > root default-devicetype:ssd > host host1-devicetype:ssd > osd.0 > osd.1 > host host2-devicetype:ssd > osd.2 > osd.3 > root default-devicetype:hdd > host host1-devicetype:hdd > osd.4 > osd.5 > host host2-devicetype:hdd > osd.6 > osd.7 > > The nice thing about this is the crush location script goes on specifying > the same thing it does now, like host=host1 rack=rack1 etc. The only > thing we add is a devicetype=ssd or hdd, perhaps based on was we glean > from the /sys/block/* (e.g., there is a 'rotating' flag in there to help > identify SSDs). Rules that use 'default' will see no change. But if this > feature is enabled and we start generating trees based on the 'devicetype' > crush type we'll get a new set of automagic roots that rules can use > instead. Regardless of the representation more smarts about how we discover these capabilities sounds awesome to me. > > > This doesn't really address the Calamari problem, though... but it would > solve one of the main use-cases for customizing the map, I think? You are right about not really addressing calamari. The thing I need to solve is how to make ceph-crush-location script smart about coexisting with changes to the crush map. Gregory > > sage > > > > > > > > > > > John > > > > On Fri, Jan 16, 2015 at 2:14 PM, Gregory Meno <gmeno@xxxxxxxxxx> wrote: > > > The problem I am trying to solve is: > > > Calamari now has the ability to manage the crush map and for that to be > > > useful I need to prevent the default behavior of OSDs set update on start. > > > > > > The config surrounding crush_location seems complicated enough that I want > > > some help deciding on the best approach. > > > > > > http://tracker.ceph.com/issues/8667 contains the background info > > > > > > options: > > > - Calamari sets "osd update on start to false" on all OSDs it manages. > > > > > > - Calamari sets "osd crush location hook" on all OSDs it manages > > > > > > criteria: > > > > > > - don't piss off admins with existing clusters and configs > > > > > > - solution applies after life-cycle requires addition of new OSDs > > > > > > - ??? am I missing more > > > > > > comparison: > > > TBD > > > > > > recommendation: > > > > > > after talking to Dan the solution that seems best is: > > > > > > Have calamari set "osd crush location hook" to a script that asks either > > > calamari or the cluster for the OSDs last known location in the CRUSH map if > > > this is a new OSD fallback to a sensible default e.g. the behavior as it > > > "osd update on start" were true > > > > > > The thing I like most about this approach is that we edit the config file > > > one time. > > > > > > regards, > > > Gregory > > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html