Re: Clarification on role of PGs

Sage Weil <sage@xxxxxxxxxxxx> · Tue, 15 Feb 2011 12:35:46 -0800 (PST)

There are a few reasons.  At the most basic level, grouping things into 
PGs (you might think of them as "shards" of a pool) limits metadata and 
tracking.  Calculating the mapping is pretty cheap but not computationally 
free; it also helps us there.

A better answer has to do with reliability and the probability of failure 
loss.  If we _independently_ calculate a placement of an object, then it 
doesn't take long before you can pick any two (or even three) nodes in the 
system and there will be object stored by exactly those two nodes.  That 
means that _any_ double failure guarantees you will lose some data.  But 
generally speaking we do want replication to be declustered because it 
allows massively parallel recovery and (in general) is a reliability 
win(*).  As a practical matter, we want to balance those two things and 
mitigate when possible by imposing some order (e.g. separating replicas 
across failure domains).  PGs are the tool to do that.

They are also nice when you have, say, a small pool with a small number of 
objects.  You can break in into a small number of PGs and talk to a small 
number of OSDs (instead of talking to, say, 10 OSDs to read/write 10 
object.

>From a more practical standpoint, PGs are a simple abstraction upon which 
to implement all the peering and syncrhonization between OSDs.  The map 
update processing _only_ has to recalculate mappings for the PGs currently 
stored locally, not for every single object stored locally.  And sync is 
in terms of the PG version for the PGs shared between a pair of OSDs, not 
the versions of every object they share.  The peering protocols are a 
delicate balance between ease of synchronization, simplicity, and 
minimization of centralized metadata.

sage

(* If you do the math, it's actually a wash for 2x.  The probability of 
any data loss is the same (although you will lose some data).  Once you 
factor in factors at the margins, declustered replication is a slight win.  
If you look at the expected _amount_ of data lost, declustering is always 
a win.  See Qin Xin's paper on the publications page for more info.)

On Tue, 15 Feb 2011, Tommi Virtanen wrote:

> Hi. I'm reading the thesis, and wondering what the thinking is behind
> how Ceph uses the placement groups (PGs).
> 
> It seems that CRUSH is used for a deterministic, pseudorandom mapping,
> object_id --> pg --> osds. I'm wondering why the extra level of PGs
> was felt desirable, why that isn't just object_id --> osds.
> 
> Colin explained on IRC that the primary OSD for a PG is responsible
> for some managerial duties, but that just tells me *how* PGs are used;
> not *why*. Surely you could organize these responsibilities
> differently, e.g. manage replication of an object on an
> object-by-object basis, by the primary OSD for that object.
> 
> -- 
> :(){ :|:&};:
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html