On Mon, Apr 18, 2016 at 11:57 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > On Mon, 18 Apr 2016, Adam C. Emerson wrote: >> > I think that in those cases, we let them use a wacky object -> PG >> > mapping, and then have a linear/flat PG -> device mapping so they >> > can still get some uniformity. >> > >> > This isn't completely general, but I'd want to see an example of >> > something it can't express. Maybe those entanglement erasure codes >> > that Veronica was talking about at FAST? >> > >> > Or maybe the key step is to not assume the PG is a hash range, but >> > instead consider it an arbitrary bucket in the pool. >> >> Would the idea basically be in that case that we go from OID => >> WHATEVER => DEVICE LIST without putting too many constraints on what >> 'WHATEVER' is? > > Yeah, probably with the additional/implicit restriction that the OSD -> > WHATEVER mapping is fixed. That way when a policy or topology change > happens, we do O(num WHATEVERs) remapping work. It's the O(num objects) > part that is really problematic. > > (I would say fundamental, but you *could* imagine something where you know > the old and new mapping, and try both, or incrementall walk through them.. > but, man, it would be ugly. At that point you're probably better of just > migrating objects from pool A to pool B.) Actually, exactly that is a thing I've blue skied with...somebody...as a more graceful way of dealing with cluster expansions. You could have CRUSH forks/epochs/whatever, and when making drastic changes to your cluster, create a new fork. (eg, we just tripled our storage and want all new data to be on the new servers, but to keep it in the same pool, and not go through a data migration right now.) Then any object access addresses every fork (or maybe you have metadata which knows which fork it was written in) looking for it. Then you could do things like incrementally merging forks (eg, fork 1 for all objects > hash x, otherwise it has moved to fork 2) to bring them into unison. I'm not 100% certain this is better than just getting our pg replication easier to work well with, but it has some nice properties (it makes moving a single PG within the cluster easy and gives the admin a lot more control) and, unlike more manual movement (rbd snapshot+copyup) doesn't require intelligent clients, just an intelligent Objecter. On the downside, object fetches suddenly turn into 2 seek IOs instead of 1. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html