Re: Verification of Crush Rules

Tommi Virtanen <tommi.virtanen@xxxxxxxxxxxxx> · Fri, 8 Jul 2011 09:54:33 -0700

On Tue, Jul 5, 2011 at 07:34, Mark Nigh <mnigh@xxxxxxxxxxxxxxx> wrote:
> I have created a cluster with 2 nodes and 6 osds each.
>
> I would like to verify if my pgs are being placed on the correct nodes based on my crushmap. I would like to make sure that my replication (x2, default) is not placed on the same host. Osd.0 through 5 is on host0 and osd.6 through osd.11 is on host1.
...
> The end of "ceph pg dump -o -" shows the following which doesn't look correct.
>
> osdstat kbused  kbavail kb      hb in   hb out
> 0       0       0       0       []      []
> 1       393884  2927750484      2930265540      [0,2,3,4,5]     [0,2,3,4,5]
> 2       304580  2927838884      2930265540      [0,1,3,4,5]     [0,1,3,4,5]
> 3       0       0       0       []      []
> 4       0       0       0       []      []
> 5       158908  2927983588      2930265540      [0,1,2,3,4]     [0,1,2,3,4]
> 6       496     2892788952      2894914980      []      []
> 7       504     2928139544      2930265540      []      []
> 8       504     2928139544      2930265540      []      []
> 9       504     2928139544      2930265540      []      []
> 10      504     2928139544      2930265540      []      []
> 11      504     2928139544      2930265540      []      []
>  sum    860388  26317059628     26337039300

Yes, that looks like a bunch of your osds are not doing much. Some of
them seem be failing (6-11, see "hb in" later), some seem to just be
getting no objects assigned to them (0, 3, 4) -- I'm not sure if the
kbavail==0 is a symptom of an actual problem, or just so because no
pg's got assigned to the osd by your crushmap.

You can test a crush config by simulating the data placement:

ceph osd getcrushmap -o crushmap
crushtool -i crushmap --test

Here's example output from my two-osd test cluster with perfect balancing:

devices weights (hex): [10000,10000]
rule 0 (data), x = 0..9999
 device 0:	10000
 device 1:	10000
 num results 2:	10000
rule 1 (metadata), x = 0..9999
 device 0:	10000
 device 1:	10000
 num results 2:	10000
rule 2 (rbd), x = 0..9999
 device 0:	10000
 device 1:	10000
 num results 2:	10000

Note that this doesn't consider any of the operational aspects of the
cluster, such as nodes being unreachable, full or overloaded.

> Which brings me to a couple of questions.
>
> 1. what is "hb in" and "hb out"?

It's the status of the hearbeat, whether the OSDs see each other (both
incoming and outgoing, as it's not necessarily symmetric). In your
output, OSDs 0-5 see each other; nobody sees osds 6-11 and they don't
report seeing any other OSDs.

You said that OSDs 6-11 are on the second machine; it seems that
machine is not able to talk to the monitors.

> 2. The original crush map examples shows "type device". The new versions of ceph show the first type as osds? I changed my back to device, but how do you define an osd or this automatically done for you. I mean by "define" is the section in the crush map that shows all the devices, device 0 device0.

As far as I know, the bucket type strings are pretty much arbitrary,
and you use whatever makes sense for your deployment. For example, if
you had only one osd per host, then having a both bucket types host
and osd would be unnecessary. It's just describing your physical
deployment, e.g. row/rack/machine/osd, so the crush rules can control
placement e.g. across racks.

> 3. In the new default crushmap, there is a domain bucket. What is that intended for? Host?

The default crushmap, being automatically generated, has no knowledge
of the physical layout. For example, a lot of test clusters are fully
on a single host. Hence, it uses a vague word domain as the type of
the upper level bucket. If you customize, you know better, and can use
a more descriptive word.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html