On Fri, Feb 8, 2013 at 4:45 PM, Sage Weil <sage@xxxxxxxxxxx> wrote: > Hi Marcus- > > On Fri, 8 Feb 2013, Marcus Sorensen wrote: >> I know people have been disscussing on and off about providing a >> "preferred OSD" for things like multi-datacenter, or even within a >> datacenter, choosing an OSD that would avoid traversing uplinks. Has >> there been any discussion on how to do this? I seem to remember people >> saying things like 'the crush map doesn't work that way at the >> moment'. Presumably, when a client needs to access an object, it looks >> up where the object should be stored via the crush map, which returns >> all OSDs that could be read from. Exactly. >> I was thinking this morning that you >> could potentially leave the crush map out of it, by setting a location >> for each OSD in the ceph.conf, and an /etc/ceph/location file for the >> client. Then use the absolute value of the difference to determine >> preferred OSD. So, if OSD0 was location=1, and OSD1 was location=3, >> and client 1 was location=2, then it would do the normal thing, but if >> client 1 was location=1.3, then it would prefer OSD0 for reads. >> Perhaps that's overly simplistic and wouldn't scale to meet everyone's >> requirements, but you could do multiple locations and sprinkle clients >> in between them all in various ways. Or perhaps the location is a >> matrix, so you could literally map it out on a grid with a set of >> coordinates. What ideas are being discussed around how to implement >> this? > > We can do something like this for reads today, where we pick a read > replica based on the closest IP or some other metric/mask. We generally > don't enable this because it leads to non-optimal cache behavior, but it > could in principle be enabled via a config option for certain clusters > (and in fact some of that code is already in place). Just to be specific — there are currently flags which will let the client read from local-host if it can figure that out, and those aren't heavily-tested but do work when we turn them on. Other metrics of "close" don't appear yet, though. In general, CRUSH locations seem like a good measure of closeness that the client could rely on, rather than a separate "location" value, but it does restrict the usefulness if you've configured multiple CRUSH root nodes. I think it would need to support a tree of some kind though, rather than just a linear value. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html