localized pgs

Sage Weil <sage@xxxxxxxxxxxx> · Tue, 17 Apr 2012 09:36:03 -0700 (PDT)

I think the preferred osd placement groups need to go.  They were 
originally added to get hadoop-style data placement, but it's not clear 
that that is actually desireable, we don't recommend or test them, and 
looking at them now there are some serious problems with how they 
currently work.

- The idea was that there would be a small set of pgs that are bound to 
osds, but we aren't creating them when osds are added and removed.  Nor 
can we, because data is stored in the pg and doesn't move between pgs. If 
we created new localized pgs when osds are added, that would mean 
previously stored data is now in the wrong pg.  I think the only way it 
would work (given the current approach) is if you decided at cluster 
creation time what the max number of osds would be, and they would bind 
themselves to other osds similarly to how the stable_mod() function is 
used for the object to pg mapping.

- It is muddies the current abstractions.  The whole point of Ceph is that 
OSDs can come and go, and as soon as you can say this data is stored on 
that disk the whole thing gets messy.

- The forcefeed bits in CRUSH are hugely ugly.  It would be incredibly 
satisfying to rip them out.

- Using localized PGs can very easily screw up the distribution of data in 
the system.  Users won't have any concept of how much disk space is 
available, etc., which means that Ceph would have to monitor the localized 
data on each node and move other PGs away as appropriate. Even if it did 
that, the localized data could outgrow the node size too. There are a few 
tricks that could be played to delay ENOSPC, but nothing that is 
particularly compelling.

- Nobody uses them.

I guess the question is, is there a compelling use case, and if there is, 
is there a better approach?

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html