Re: Crush not deliverying data uniformly -> HEALTH_ERR full osd

Caleb Miles <caleb.miles@xxxxxxxxxxx> · Mon, 6 Aug 2012 13:09:08 -0700

Hello Paul,

Could you post your CRUSH map, crushtool -d <CRUSH_MAP>

caleb

On Mon, Aug 6, 2012 at 1:01 PM, Yehuda Sadeh <yehuda@xxxxxxxxxxx> wrote:
>
> ---------- Forwarded message ----------
> From: Paul Pettigrew <Paul.Pettigrew@xxxxxxxxxxx>
> Date: Sun, Aug 5, 2012 at 8:08 PM
> Subject: RE: Crush not deliverying data uniformly -> HEALTH_ERR full osd
> To: Yehuda Sadeh <yehuda@xxxxxxxxxxx>
> Cc: "ceph-devel@xxxxxxxxxxxxxxx" <ceph-devel@xxxxxxxxxxxxxxx>
>
>
> Hi Yehuda, we have:
>
> root@dsanb1-coy:/mnt/ceph# ceph osd dump | grep ^pool
> pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num
> 1472 pgp_num 1472 last_change 1 owner 0 crash_replay_interval 45
> pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins
> pg_num 1472 pgp_num 1472 last_change 1 owner 0
> pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num
> 1472 pgp_num 1472 last_change 1 owner 0
> pool 3 'backup' rep size 1 crush_ruleset 3 object_hash rjenkins pg_num
> 1472 pgp_num 1472 last_change 1 owner 0
>
>
> -----Original Message-----
> From: ceph-devel-owner@xxxxxxxxxxxxxxx
> [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Yehuda Sadeh
> Sent: Monday, 6 August 2012 11:16 AM
> To: Paul Pettigrew
> Cc: ceph-devel@xxxxxxxxxxxxxxx
> Subject: Re: Crush not deliverying data uniformly -> HEALTH_ERR full osd
>
> On Sun, Aug 5, 2012 at 5:16 PM, Paul Pettigrew
> <Paul.Pettigrew@xxxxxxxxxxx> wrote:
> >
> > Hi Ceph community
> >
> > We are at the stage of performance capacity testing, where significant
> > amounts of backup data is being written to Ceph.
> >
> > The issue we have, is that the underlying HDD's are not being
> > populated
> > (roughly) uniformly, and our Ceph system hits a brick wall after a
> > couple of days our 30TB storage system is no longer able to operate
> > after having only stored ~7TB.
> >
> > Basically, despite HDD's (1:1 ratio between OSD and HDD) all being the
> > same storage size and weighting in the Crushmap, we have disks either:
> > a) using 1% space;
> > b) using 48%; or
> > c) using 96%
> > Too precise a split to be an accident.  See below for more detail
> > (osd11-22 not expected to get data, per our crushmap):
> >
> >
> > ceph pg dump
> > <snip>
> > pool 0  2442    0       0       0       10240000000     7302520 7302520
> > pool 1  57      0       0       0       127824767       5603518 5603518
> > pool 2  0       0       0       0       0       0       0
> > pool 3  1808757 0       0       0       7584377697985   1104048 1104048
> >  sum    1811256 0       0       0       7594745522752   14010086
> > 14010086
> > osdstat kbused  kbavail kb      hb in   hb out
> > 0       930606904       1021178408      1953514584
> > [11,12,13,14,15,16,17,18,19,20,21,22]   []
> > 1       1874428 1949525164      1953514584
> > [11,12,13,14,15,16,17,18,19,20,21,22]   []
> > 2       928811428       1022963676      1953514584
> > [11,12,13,14,15,16,17,18,19,20,21,22]   []
> > 3       929733676       1022051996      1953514584
> > [11,12,13,14,15,16,17,18,19,20,21,22]   []
> > 4       1719124 1949678844      1953514584
> > [11,12,13,14,15,16,17,18,19,20,21,22]   []
> > 5       1853452 1949545892      1953514584
> > [11,12,13,14,15,16,17,18,19,20,21,22]   []
> > 6       930979476       1020807132      1953514584
> > [11,12,13,14,15,16,17,18,19,20,21,22]   []
> > 7       1808968 1949590496      1953514584
> > [11,12,13,14,15,16,17,18,19,20,21,22]   []
> > 8       934035924       1017759100      1953514584
> > [11,12,13,14,15,16,17,18,19,20,21,22]   []
> > 9       1855955384      94927432        1953514584
> > [11,12,13,14,15,16,17,18,19,20,21,22]   []
> > 10      933572004       1018232340      1953514584
> > [11,12,13,14,15,16,17,18,19,20,21,22]   []
> > 11      2057096 953060760       957230808
> > [0,1,2,3,4,5,6,7,8,9,10,17,18,19,20,21,22]      []
> > 12      2053512 953064656       957230808
> > [0,1,2,3,4,5,6,7,8,9,10,17,18,19,20,21,22]      []
> > 13      2148732 972501316       976762584
> > [0,1,2,3,4,5,6,7,8,9,10,17,18,19,20,21,22]      []
> > 14      2064640 972585104       976762584
> > [0,1,2,3,4,5,6,7,8,9,10,17,18,19,20,21,22]      []
> > 15      1945388 972703468       976762584
> > [0,1,2,3,4,5,6,7,8,9,10,17,18,19,20,21] []
> > 16      2051708 972599412       976762584
> > [0,1,2,3,4,6,7,8,9,10,17,18,19,20,21]   []
> > 17      2137632 952980216       957230808
> > [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]      []
> > 18      2000124 953117508       957230808
> > [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]      []
> > 19      2095124 972554492       976762584
> > [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]      []
> > 20      1986800 972662640       976762584
> > [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]      []
> > 21      2035204 972615332       976762584
> > [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]      []
> > 22      1961412 972687788       976762584
> > [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]      []
> >  sum    7475488140      25609393172     33131684328
> >
> > 2012-08-06 10:03:58.964716 7f06783bb700  0 -- 10.32.0.10:0/15147
> > send_keepalive con 0x223f690, no pipe.
> >
> >
> > root@dsanb1-coy:~# df -h
> > Filesystem                               Size  Used Avail Use% Mounted on
> > /dev/md0                                 462G   12G  446G   3% /
> > udev                                      12G  4.0K   12G   1% /dev
> > tmpfs                                    4.8G  448K  4.8G   1% /run
> > none                                     5.0M     0  5.0M   0% /run/lock
> > none                                      12G     0   12G   0% /run/shm
> > /dev/sdc                                 1.9T  888G  974G  48%
> > /ceph-data/osd.0
> > /dev/sdd                                 1.9T  1.8G  1.9T   1%
> > /ceph-data/osd.1
> > /dev/sdp                                 1.9T  891G  972G  48%
> > /ceph-data/osd.10
> > /dev/sde                                 1.9T  886G  976G  48%
> > /ceph-data/osd.2
> > /dev/sdf                                 1.9T  887G  975G  48%
> > /ceph-data/osd.3
> > /dev/sdg                                 1.9T  1.7G  1.9T   1%
> > /ceph-data/osd.4
> > /dev/sdh                                 1.9T  1.8G  1.9T   1%
> > /ceph-data/osd.5
> > /dev/sdi                                 1.9T  888G  974G  48%
> > /ceph-data/osd.6
> > /dev/sdm                                 1.9T  1.8G  1.9T   1%
> > /ceph-data/osd.7
> > /dev/sdn                                 1.9T  891G  971G  48%
> > /ceph-data/osd.8
> > /dev/sdo                                 1.9T  1.8T   91G  96%
> > /ceph-data/osd.9
> > 10.32.0.10,10.32.0.25,10.32.0.11:6789:/   31T  7.1T   24T  23% /mnt/ceph
> >
> >
> > We are writing via fstab based cephfs mounts, and the above is going
> > to pool3, which is a "backup" pool where we are testing replication
> > level of 1x only. This should not have any effect though? Below will
> > illustrate the layout we are using (above data writing issue is only
> > going to the first node per our testing design):
> >
> > root@dsanb1-coy:~# ceph osd tree
> > dumped osdmap tree epoch 136
> > # id    weight  type name       up/down reweight
> > -7      23      zone bak
> > -6      23              rack 1nrack
> > -2      11                      host dsanb1-coy
> > 0       2                               osd.0   up      1
> > 1       2                               osd.1   up      1
> > 10      2                               osd.10  up      1
> > 2       2                               osd.2   up      1
> > 3       2                               osd.3   up      1
> > 4       2                               osd.4   up      1
> > 5       2                               osd.5   up      1
> > 6       2                               osd.6   up      1
> > 7       2                               osd.7   up      1
> > 8       2                               osd.8   up      1
> > 9       2                               osd.9   up      1
> > -1      23      zone default
> > -3      23              rack 2nrack
> > -2      11                      host dsanb1-coy
> > 0       2                               osd.0   up      1
> > 1       2                               osd.1   up      1
> > 10      2                               osd.10  up      1
> > 2       2                               osd.2   up      1
> > 3       2                               osd.3   up      1
> > 4       2                               osd.4   up      1
> > 5       2                               osd.5   up      1
> > 6       2                               osd.6   up      1
> > 7       2                               osd.7   up      1
> > 8       2                               osd.8   up      1
> > 9       2                               osd.9   up      1
> > -4      6                       host dsanb2-coy
> > 11      1                               osd.11  up      1
> > 12      1                               osd.12  up      1
> > 13      1                               osd.13  up      1
> > 14      1                               osd.14  up      1
> > 15      1                               osd.15  up      1
> > 16      1                               osd.16  up      1
> > -5      6                       host dsanb3-coy
> > 17      1                               osd.17  up      1
> > 18      1                               osd.18  up      1
> > 19      1                               osd.19  up      1
> > 20      1                               osd.20  up      1
> > 21      1                               osd.21  up      1
> > 22      1                               osd.22  up      1
> >
> >
> > Has anybody got any suggestions?
> >
>
> How many pgs per pool do you have? Specifically:
> $ ceph osd dump | grep ^pool
>
> Thanks,
> Yehuda
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
> info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html