Re: Crush not deliverying data uniformly -> HEALTH_ERR full osd

Caleb Miles <caleb.miles@xxxxxxxxxxx> · Mon, 6 Aug 2012 16:28:30 -0700

Hi Paul,

What version of Ceph are you running, perhaps your issue could be
related to an issue with the choose_local_tries parameter used in
earlier versions of the CRUSH mapper code..

caleb

On Mon, Aug 6, 2012 at 3:40 PM, Paul Pettigrew
<Paul.Pettigrew@xxxxxxxxxxx> wrote:
> Hi Caleb
> Crushmap below, thanks!
> Paul
>
>
>
> root@dsanb1-coy:~# cat crushfile.txt
> # begin crush map
>
> # devices
> device 0 osd.0
> device 1 osd.1
> device 2 osd.2
> device 3 osd.3
> device 4 osd.4
> device 5 osd.5
> device 6 osd.6
> device 7 osd.7
> device 8 osd.8
> device 9 osd.9
> device 10 osd.10
> device 11 osd.11
> device 12 osd.12
> device 13 osd.13
> device 14 osd.14
> device 15 osd.15
> device 16 osd.16
> device 17 osd.17
> device 18 osd.18
> device 19 osd.19
> device 20 osd.20
> device 21 osd.21
> device 22 osd.22
>
> # types
> type 0 osd
> type 1 host
> type 2 rack
> type 3 zone
>
> # buckets
> host dsanb1-coy {
>         id -2           # do not change unnecessarily
>         # weight 11.000
>         alg straw
>         hash 0  # rjenkins1
>         item osd.0 weight 2.000
>         item osd.1 weight 2.000
>         item osd.10 weight 2.000
>         item osd.2 weight 2.000
>         item osd.3 weight 2.000
>         item osd.4 weight 2.000
>         item osd.5 weight 2.000
>         item osd.6 weight 2.000
>         item osd.7 weight 2.000
>         item osd.8 weight 2.000
>         item osd.9 weight 2.000
> }
> host dsanb2-coy {
>         id -4           # do not change unnecessarily
>         # weight 6.000
>         alg straw
>         hash 0  # rjenkins1
>         item osd.11 weight 1.000
>         item osd.12 weight 1.000
>         item osd.13 weight 1.000
>         item osd.14 weight 1.000
>         item osd.15 weight 1.000
>         item osd.16 weight 1.000
> }
> host dsanb3-coy {
>         id -5           # do not change unnecessarily
>         # weight 6.000
>         alg straw
>         hash 0  # rjenkins1
>         item osd.17 weight 1.000
>         item osd.18 weight 1.000
>         item osd.19 weight 1.000
>         item osd.20 weight 1.000
>         item osd.21 weight 1.000
>         item osd.22 weight 1.000
> }
> rack 2nrack {
>         id -3           # do not change unnecessarily
>         # weight 23.000
>         alg straw
>         hash 0  # rjenkins1
>         item dsanb1-coy weight 11.000
>         item dsanb2-coy weight 6.000
>         item dsanb3-coy weight 6.000
> }
> zone default {
>         id -1           # do not change unnecessarily
>         # weight 23.000
>         alg straw
>         hash 0  # rjenkins1
>         item 2nrack weight 23.000
> }
> rack 1nrack {
>         id -6           # do not change unnecessarily
>         # weight 11.000
>         alg straw
>         hash 0  # rjenkins1
>         item dsanb1-coy weight 11.000
> }
> zone bak {
>         id -7           # do not change unnecessarily
>         # weight 23.000
>         alg straw
>         hash 0  # rjenkins1
>         item 1nrack weight 23.000
> }
>
> # rules
> rule data {
>         ruleset 0
>         type replicated
>         min_size 1
>         max_size 10
>         step take default
>         step chooseleaf firstn 0 type host
>         step emit
> }
> rule metadata {
>         ruleset 1
>         type replicated
>         min_size 1
>         max_size 10
>         step take default
>         step chooseleaf firstn 0 type host
>         step emit
> }
> rule rbd {
>         ruleset 2
>         type replicated
>         min_size 1
>         max_size 10
>         step take default
>         step chooseleaf firstn 0 type host
>         step emit
> }
> rule backup {
>         ruleset 3
>         type replicated
>         min_size 1
>         max_size 10
>         step take bak
>         step chooseleaf firstn 0 type host
>         step emit
> }
>
> # end crush map
>
>
> -----Original Message-----
> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Caleb Miles
> Sent: Tuesday, 7 August 2012 6:09 AM
> To: ceph-devel@xxxxxxxxxxxxxxx
> Subject: Re: Crush not deliverying data uniformly -> HEALTH_ERR full osd
>
> Hello Paul,
>
> Could you post your CRUSH map, crushtool -d <CRUSH_MAP>
>
> caleb
>
> On Mon, Aug 6, 2012 at 1:01 PM, Yehuda Sadeh <yehuda@xxxxxxxxxxx> wrote:
>>
>> ---------- Forwarded message ----------
>> From: Paul Pettigrew <Paul.Pettigrew@xxxxxxxxxxx>
>> Date: Sun, Aug 5, 2012 at 8:08 PM
>> Subject: RE: Crush not deliverying data uniformly -> HEALTH_ERR full
>> osd
>> To: Yehuda Sadeh <yehuda@xxxxxxxxxxx>
>> Cc: "ceph-devel@xxxxxxxxxxxxxxx" <ceph-devel@xxxxxxxxxxxxxxx>
>>
>>
>> Hi Yehuda, we have:
>>
>> root@dsanb1-coy:/mnt/ceph# ceph osd dump | grep ^pool pool 0 'data'
>> rep size 2 crush_ruleset 0 object_hash rjenkins pg_num
>> 1472 pgp_num 1472 last_change 1 owner 0 crash_replay_interval 45 pool
>> 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num
>> 1472 pgp_num 1472 last_change 1 owner 0 pool 2 'rbd' rep size 2
>> crush_ruleset 2 object_hash rjenkins pg_num
>> 1472 pgp_num 1472 last_change 1 owner 0 pool 3 'backup' rep size 1
>> crush_ruleset 3 object_hash rjenkins pg_num
>> 1472 pgp_num 1472 last_change 1 owner 0
>>
>>
>> -----Original Message-----
>> From: ceph-devel-owner@xxxxxxxxxxxxxxx
>> [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Yehuda Sadeh
>> Sent: Monday, 6 August 2012 11:16 AM
>> To: Paul Pettigrew
>> Cc: ceph-devel@xxxxxxxxxxxxxxx
>> Subject: Re: Crush not deliverying data uniformly -> HEALTH_ERR full
>> osd
>>
>> On Sun, Aug 5, 2012 at 5:16 PM, Paul Pettigrew
>> <Paul.Pettigrew@xxxxxxxxxxx> wrote:
>> >
>> > Hi Ceph community
>> >
>> > We are at the stage of performance capacity testing, where
>> > significant amounts of backup data is being written to Ceph.
>> >
>> > The issue we have, is that the underlying HDD's are not being
>> > populated
>> > (roughly) uniformly, and our Ceph system hits a brick wall after a
>> > couple of days our 30TB storage system is no longer able to operate
>> > after having only stored ~7TB.
>> >
>> > Basically, despite HDD's (1:1 ratio between OSD and HDD) all being
>> > the same storage size and weighting in the Crushmap, we have disks either:
>> > a) using 1% space;
>> > b) using 48%; or
>> > c) using 96%
>> > Too precise a split to be an accident.  See below for more detail
>> > (osd11-22 not expected to get data, per our crushmap):
>> >
>> >
>> > ceph pg dump
>> > <snip>
>> > pool 0  2442    0       0       0       10240000000     7302520 7302520
>> > pool 1  57      0       0       0       127824767       5603518 5603518
>> > pool 2  0       0       0       0       0       0       0
>> > pool 3  1808757 0       0       0       7584377697985   1104048 1104048
>> >  sum    1811256 0       0       0       7594745522752   14010086
>> > 14010086
>> > osdstat kbused  kbavail kb      hb in   hb out
>> > 0       930606904       1021178408      1953514584
>> > [11,12,13,14,15,16,17,18,19,20,21,22]   []
>> > 1       1874428 1949525164      1953514584
>> > [11,12,13,14,15,16,17,18,19,20,21,22]   []
>> > 2       928811428       1022963676      1953514584
>> > [11,12,13,14,15,16,17,18,19,20,21,22]   []
>> > 3       929733676       1022051996      1953514584
>> > [11,12,13,14,15,16,17,18,19,20,21,22]   []
>> > 4       1719124 1949678844      1953514584
>> > [11,12,13,14,15,16,17,18,19,20,21,22]   []
>> > 5       1853452 1949545892      1953514584
>> > [11,12,13,14,15,16,17,18,19,20,21,22]   []
>> > 6       930979476       1020807132      1953514584
>> > [11,12,13,14,15,16,17,18,19,20,21,22]   []
>> > 7       1808968 1949590496      1953514584
>> > [11,12,13,14,15,16,17,18,19,20,21,22]   []
>> > 8       934035924       1017759100      1953514584
>> > [11,12,13,14,15,16,17,18,19,20,21,22]   []
>> > 9       1855955384      94927432        1953514584
>> > [11,12,13,14,15,16,17,18,19,20,21,22]   []
>> > 10      933572004       1018232340      1953514584
>> > [11,12,13,14,15,16,17,18,19,20,21,22]   []
>> > 11      2057096 953060760       957230808
>> > [0,1,2,3,4,5,6,7,8,9,10,17,18,19,20,21,22]      []
>> > 12      2053512 953064656       957230808
>> > [0,1,2,3,4,5,6,7,8,9,10,17,18,19,20,21,22]      []
>> > 13      2148732 972501316       976762584
>> > [0,1,2,3,4,5,6,7,8,9,10,17,18,19,20,21,22]      []
>> > 14      2064640 972585104       976762584
>> > [0,1,2,3,4,5,6,7,8,9,10,17,18,19,20,21,22]      []
>> > 15      1945388 972703468       976762584
>> > [0,1,2,3,4,5,6,7,8,9,10,17,18,19,20,21] []
>> > 16      2051708 972599412       976762584
>> > [0,1,2,3,4,6,7,8,9,10,17,18,19,20,21]   []
>> > 17      2137632 952980216       957230808
>> > [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]      []
>> > 18      2000124 953117508       957230808
>> > [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]      []
>> > 19      2095124 972554492       976762584
>> > [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]      []
>> > 20      1986800 972662640       976762584
>> > [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]      []
>> > 21      2035204 972615332       976762584
>> > [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]      []
>> > 22      1961412 972687788       976762584
>> > [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]      []
>> >  sum    7475488140      25609393172     33131684328
>> >
>> > 2012-08-06 10:03:58.964716 7f06783bb700  0 -- 10.32.0.10:0/15147
>> > send_keepalive con 0x223f690, no pipe.
>> >
>> >
>> > root@dsanb1-coy:~# df -h
>> > Filesystem                               Size  Used Avail Use% Mounted on
>> > /dev/md0                                 462G   12G  446G   3% /
>> > udev                                      12G  4.0K   12G   1% /dev
>> > tmpfs                                    4.8G  448K  4.8G   1% /run
>> > none                                     5.0M     0  5.0M   0% /run/lock
>> > none                                      12G     0   12G   0% /run/shm
>> > /dev/sdc                                 1.9T  888G  974G  48%
>> > /ceph-data/osd.0
>> > /dev/sdd                                 1.9T  1.8G  1.9T   1%
>> > /ceph-data/osd.1
>> > /dev/sdp                                 1.9T  891G  972G  48%
>> > /ceph-data/osd.10
>> > /dev/sde                                 1.9T  886G  976G  48%
>> > /ceph-data/osd.2
>> > /dev/sdf                                 1.9T  887G  975G  48%
>> > /ceph-data/osd.3
>> > /dev/sdg                                 1.9T  1.7G  1.9T   1%
>> > /ceph-data/osd.4
>> > /dev/sdh                                 1.9T  1.8G  1.9T   1%
>> > /ceph-data/osd.5
>> > /dev/sdi                                 1.9T  888G  974G  48%
>> > /ceph-data/osd.6
>> > /dev/sdm                                 1.9T  1.8G  1.9T   1%
>> > /ceph-data/osd.7
>> > /dev/sdn                                 1.9T  891G  971G  48%
>> > /ceph-data/osd.8
>> > /dev/sdo                                 1.9T  1.8T   91G  96%
>> > /ceph-data/osd.9
>> > 10.32.0.10,10.32.0.25,10.32.0.11:6789:/   31T  7.1T   24T  23% /mnt/ceph
>> >
>> >
>> > We are writing via fstab based cephfs mounts, and the above is going
>> > to pool3, which is a "backup" pool where we are testing replication
>> > level of 1x only. This should not have any effect though? Below will
>> > illustrate the layout we are using (above data writing issue is only
>> > going to the first node per our testing design):
>> >
>> > root@dsanb1-coy:~# ceph osd tree
>> > dumped osdmap tree epoch 136
>> > # id    weight  type name       up/down reweight
>> > -7      23      zone bak
>> > -6      23              rack 1nrack
>> > -2      11                      host dsanb1-coy
>> > 0       2                               osd.0   up      1
>> > 1       2                               osd.1   up      1
>> > 10      2                               osd.10  up      1
>> > 2       2                               osd.2   up      1
>> > 3       2                               osd.3   up      1
>> > 4       2                               osd.4   up      1
>> > 5       2                               osd.5   up      1
>> > 6       2                               osd.6   up      1
>> > 7       2                               osd.7   up      1
>> > 8       2                               osd.8   up      1
>> > 9       2                               osd.9   up      1
>> > -1      23      zone default
>> > -3      23              rack 2nrack
>> > -2      11                      host dsanb1-coy
>> > 0       2                               osd.0   up      1
>> > 1       2                               osd.1   up      1
>> > 10      2                               osd.10  up      1
>> > 2       2                               osd.2   up      1
>> > 3       2                               osd.3   up      1
>> > 4       2                               osd.4   up      1
>> > 5       2                               osd.5   up      1
>> > 6       2                               osd.6   up      1
>> > 7       2                               osd.7   up      1
>> > 8       2                               osd.8   up      1
>> > 9       2                               osd.9   up      1
>> > -4      6                       host dsanb2-coy
>> > 11      1                               osd.11  up      1
>> > 12      1                               osd.12  up      1
>> > 13      1                               osd.13  up      1
>> > 14      1                               osd.14  up      1
>> > 15      1                               osd.15  up      1
>> > 16      1                               osd.16  up      1
>> > -5      6                       host dsanb3-coy
>> > 17      1                               osd.17  up      1
>> > 18      1                               osd.18  up      1
>> > 19      1                               osd.19  up      1
>> > 20      1                               osd.20  up      1
>> > 21      1                               osd.21  up      1
>> > 22      1                               osd.22  up      1
>> >
>> >
>> > Has anybody got any suggestions?
>> >
>>
>> How many pgs per pool do you have? Specifically:
>> $ ceph osd dump | grep ^pool
>>
>> Thanks,
>> Yehuda
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html