Re: crush rule definitions

Zenon Panoussis <oracle@xxxxxxxxxxxxxxx> · Wed, 04 May 2011 21:20:38 +0200

On 05/04/2011 08:21 PM, Sage Weil wrote:

>> does "min_size 2, max_size 2" mean that I want "2 copies of the data on each
>> host" or "2 copies of the data in total in the entire cluster"?

> Neither, actually.  It means that this rule will be used when we ask crush 
> for ruleset 0 and 2 replicas.  If you change a pg to have 3x replication, 
> ceph will ask for ruleset 0 and 3 replicas, and this rule won't be used.

In other words, the total number of replicas in the cluster is determined on
the PG level? But then how do I control which PGs are physically stored where?

> You probably want min_size 1 and max_size 10.

Taking what you just wrote together with a re-reading of the wiki, I must admit
that I still don't quite grasp it. The wiki says

  That is, when placing object replicas, we start at the root hierarchy, and
  choose N items of type 'device'. ('0' means to grab however many replicas.
  The rules are written to be general for some range of N, 1-10 in this case.)

What I make out of all this is that

rule data {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take root
        step choose firstn 0 type device
        step emit
}

means that IF the PGs are set to create anything between 1 and 10 replicas, then
the replicas should be placed on devices, using an unlimited number of devices.

Is that correct?

My problem really is how to configure ceph to put exactly 1 replica of the data
(and metadata) on each and every of some kind of target. For example, if I have
10 racks, I want exactly 1 copy of the data in each rack, no more, no less (and
I don't care which host in that rack the data lands on). If I have 10 hosts,
I want exactly 1 copy of the data on each host (and I don't care which OSD on
that host the data lands on). If I only have 10 OSDs, I want exactly 1 copy of
the data on each and every OSD.

Assuming that the number of targets is fixed and known, what is the way to do
this?

And going back to PGs, if "ceph osd dump -o -|grep pg_size" says

 pg_pool 0 'data' pg_pool(rep pg_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 lpg_num 2 lpgp_num 2 last_change 66 owner 0)

and "ceph -w" says

 pg v319405: 528 pgs: 528 active+clean; 22702 MB data, 77093 MB used, 346 GB / 446 GB avail

how do the 128 PGs of "ceph osd dump" relate to the 528 PGs of "ceph -w"?

*

As an aside, I think that, to a certain extent, improving the documentation could
contribute more to the code base than improving the actual code. You guys spend a
lot of time answering the kind of questions that I've been posing (and thank you
for doing so), while at the same time missing out on the debugging help you could
be getting instead if your user base could move past its trivial problems. If I were
your scrum master, I'd dedicate an entire sprint on the wiki alone.

Z

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html