On 19 April 2016 at 05:07, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > On Mon, Apr 18, 2016 at 11:57 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote: >> On Mon, 18 Apr 2016, Adam C. Emerson wrote: >>> > I think that in those cases, we let them use a wacky object -> PG >>> > mapping, and then have a linear/flat PG -> device mapping so they >>> > can still get some uniformity. >>> > >>> > This isn't completely general, but I'd want to see an example of >>> > something it can't express. Maybe those entanglement erasure codes >>> > that Veronica was talking about at FAST? >>> > >>> > Or maybe the key step is to not assume the PG is a hash range, but >>> > instead consider it an arbitrary bucket in the pool. >>> >>> Would the idea basically be in that case that we go from OID => >>> WHATEVER => DEVICE LIST without putting too many constraints on what >>> 'WHATEVER' is? >> >> Yeah, probably with the additional/implicit restriction that the OSD -> >> WHATEVER mapping is fixed. That way when a policy or topology change >> happens, we do O(num WHATEVERs) remapping work. It's the O(num objects) >> part that is really problematic. >> >> (I would say fundamental, but you *could* imagine something where you know >> the old and new mapping, and try both, or incrementall walk through them.. >> but, man, it would be ugly. At that point you're probably better of just >> migrating objects from pool A to pool B.) > > Actually, exactly that is a thing I've blue skied with...somebody...as > a more graceful way of dealing with cluster expansions. You could have > CRUSH forks/epochs/whatever, and when making drastic changes to your > cluster, create a new fork. (eg, we just tripled our storage and want > all new data to be on the new servers, but to keep it in the same > pool, and not go through a data migration right now.) Then any object > access addresses every fork (or maybe you have metadata which knows > which fork it was written in) looking for it. > Then you could do things like incrementally merging forks (eg, fork 1 > for all objects > hash x, otherwise it has moved to fork 2) to bring > them into unison. > > I'm not 100% certain this is better than just getting our pg > replication easier to work well with, but it has some nice properties > (it makes moving a single PG within the cluster easy and gives the > admin a lot more control) and, unlike more manual movement (rbd > snapshot+copyup) doesn't require intelligent clients, just an > intelligent Objecter. On the downside, object fetches suddenly turn > into 2 seek IOs instead of 1. Interesting thread. Just wanted to say, as an institution with a couple of fairly large clusters and a few years of operational experience under our belts now, this particular comment of Greg's resonates. Crush is awesome and elegant, but one of our biggest operational pain-points is managing map changes and their impacts. Our reality is that we impose more map changes on the cluster through standard maintenance and expansion/refresh activity than do failures (even at the ~1000 drive scale), and we'd really like to have some more flexibility around managing those activities. -- Cheers, ~Blairo -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html