Re: cephfs full, 2/3 Raw capacity used

Paul Emmerich <paul.emmerich@xxxxxxxx> · Mon, 26 Aug 2019 12:38:39 +0200

The balancer is unfortunately not that good when you have large k+m in
erasure coding profiles and relatively few servers, some manual
balancing will be required

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Aug 26, 2019 at 12:33 PM Simon Oosthoek
<s.oosthoek@xxxxxxxxxxxxx> wrote:
>
> On 26-08-19 12:00, EDH - Manuel Rios Fernandez wrote:
> > Balancer just balance in Healthy mode.
> >
> > The problem is that data is distributed without be balanced in their first
> > write, that cause unproperly data balanced across osd.
>
> I suppose the crush algorithm doesn't take the fullness of the osds into
> account when placing objects...
>
> >
> > This problem only happens in CEPH, we are the same with 14.2.2, having to
> > change the weight manually.Because the balancer is a passive element of the
> > cluster.
> >
> > I hope in next version we get a more aggressive balancer, like enterprises
> > storages that allow to fill up 95% storage (raw).
>
> I'm thinking a cronjob with a script to parse the output of `ceph osd df
> tree` and reweight according to the percentage used would be relatively
> easy to write. But I'll concentrate on monitoring before I start
> tweaking there ;-)
>
> Cheers
>
> /Simon
>
> >
> > Regards
> >
> >
> > -----Mensaje original-----
> > De: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> En nombre de Simon
> > Oosthoek
> > Enviado el: lunes, 26 de agosto de 2019 11:52
> > Para: Dan van der Ster <dan@xxxxxxxxxxxxxx>
> > CC: ceph-users <ceph-users@xxxxxxxxxxxxxx>
> > Asunto: Re:  cephfs full, 2/3 Raw capacity used
> >
> > On 26-08-19 11:37, Dan van der Ster wrote:
> >> Thanks. The version and balancer config look good.
> >>
> >> So you can try `ceph osd reweight osd.10 0.8` to see if it helps to
> >> get you out of this.
> >
> > I've done this and the next fullest 3 osds. This will take some time to
> > recover, I'll let you know when it's done.
> >
> > Thanks,
> >
> > /simon
> >
> >>
> >> -- dan
> >>
> >> On Mon, Aug 26, 2019 at 11:35 AM Simon Oosthoek
> >> <s.oosthoek@xxxxxxxxxxxxx> wrote:
> >>>
> >>> On 26-08-19 11:16, Dan van der Ster wrote:
> >>>> Hi,
> >>>>
> >>>> Which version of ceph are you using? Which balancer mode?
> >>>
> >>> Nautilus (14.2.2), balancer is in upmap mode.
> >>>
> >>>> The balancer score isn't a percent-error or anything humanly usable.
> >>>> `ceph osd df tree` can better show you exactly which osds are
> >>>> over/under utilized and by how much.
> >>>>
> >>>
> >>> Aha, I ran this and sorted on the %full column:
> >>>
> >>>     81   hdd   10.81149  1.00000  11 TiB 5.2 TiB 5.1 TiB   4 KiB  14 GiB
> >>> 5.6 TiB 48.40 0.73  96     up                 osd.81
> >>>     48   hdd   10.81149  1.00000  11 TiB 5.3 TiB 5.2 TiB  15 KiB  14 GiB
> >>> 5.5 TiB 49.08 0.74  95     up                 osd.48
> >>> 154   hdd   10.81149  1.00000  11 TiB 5.5 TiB 5.4 TiB 2.6 GiB  15 GiB
> >>> 5.3 TiB 50.95 0.76  96     up                 osd.154
> >>> 129   hdd   10.81149  1.00000  11 TiB 5.5 TiB 5.4 TiB 5.1 GiB  16 GiB
> >>> 5.3 TiB 51.33 0.77  96     up                 osd.129
> >>>     42   hdd   10.81149  1.00000  11 TiB 5.6 TiB 5.5 TiB 2.6 GiB  14 GiB
> >>> 5.2 TiB 51.81 0.78  96     up                 osd.42
> >>> 122   hdd   10.81149  1.00000  11 TiB 5.7 TiB 5.6 TiB  16 KiB  14 GiB
> >>> 5.1 TiB 52.47 0.79  96     up                 osd.122
> >>> 120   hdd   10.81149  1.00000  11 TiB 5.7 TiB 5.6 TiB 2.6 GiB  15 GiB
> >>> 5.1 TiB 52.92 0.79  95     up                 osd.120
> >>>     96   hdd   10.81149  1.00000  11 TiB 5.8 TiB 5.7 TiB 2.6 GiB  15 GiB
> >>> 5.0 TiB 53.58 0.80  96     up                 osd.96
> >>>     26   hdd   10.81149  1.00000  11 TiB 5.8 TiB 5.7 TiB  20 KiB  15 GiB
> >>> 5.0 TiB 53.68 0.80  97     up                 osd.26
> >>> ...
> >>>      6   hdd   10.81149  1.00000  11 TiB 8.3 TiB 8.2 TiB  88 KiB  18 GiB
> >>> 2.5 TiB 77.14 1.16  96     up                 osd.6
> >>>     16   hdd   10.81149  1.00000  11 TiB 8.4 TiB 8.3 TiB  28 KiB  18 GiB
> >>> 2.4 TiB 77.56 1.16  95     up                 osd.16
> >>>      0   hdd   10.81149  1.00000  11 TiB 8.6 TiB 8.4 TiB  48 KiB  17 GiB
> >>> 2.2 TiB 79.24 1.19  96     up                 osd.0
> >>> 144   hdd   10.81149  1.00000  11 TiB 8.6 TiB 8.5 TiB 2.6 GiB  18 GiB
> >>> 2.2 TiB 79.57 1.19  95     up                 osd.144
> >>> 136   hdd   10.81149  1.00000  11 TiB 8.6 TiB 8.5 TiB  48 KiB  17 GiB
> >>> 2.2 TiB 79.60 1.19  95     up                 osd.136
> >>>     63   hdd   10.81149  1.00000  11 TiB 8.6 TiB 8.5 TiB 2.6 GiB  17 GiB
> >>> 2.2 TiB 79.60 1.19  95     up                 osd.63
> >>> 155   hdd   10.81149  1.00000  11 TiB 8.6 TiB 8.5 TiB   8 KiB  19 GiB
> >>> 2.2 TiB 79.85 1.20  95     up                 osd.155
> >>>     89   hdd   10.81149  1.00000  11 TiB 8.7 TiB 8.5 TiB  12 KiB  20 GiB
> >>> 2.2 TiB 80.04 1.20  96     up                 osd.89
> >>> 106   hdd   10.81149  1.00000  11 TiB 8.8 TiB 8.7 TiB  64 KiB  19 GiB
> >>> 2.0 TiB 81.38 1.22  96     up                 osd.106
> >>>     94   hdd   10.81149  1.00000  11 TiB 9.0 TiB 8.9 TiB     0 B  19 GiB
> >>> 1.8 TiB 83.53 1.25  96     up                 osd.94
> >>>     33   hdd   10.81149  1.00000  11 TiB 9.1 TiB 9.0 TiB  44 KiB  19 GiB
> >>> 1.7 TiB 84.40 1.27  96     up                 osd.33
> >>>     15   hdd   10.81149  1.00000  11 TiB  10 TiB 9.8 TiB  16 KiB  20 GiB
> >>> 877 GiB 92.08 1.38  96     up                 osd.15
> >>>     53   hdd   10.81149  1.00000  11 TiB  10 TiB  10 TiB 2.6 GiB  20 GiB
> >>> 676 GiB 93.90 1.41  96     up                 osd.53
> >>>     51   hdd   10.81149  1.00000  11 TiB  10 TiB  10 TiB 2.6 GiB  20 GiB
> >>> 666 GiB 93.98 1.41  96     up                 osd.51
> >>>     10   hdd   10.81149  1.00000  11 TiB  10 TiB  10 TiB  40 KiB  22 GiB
> >>> 552 GiB 95.01 1.42  97     up                 osd.10
> >>>
> >>> So the fullest one is at 95.01%, the emptiest one at 48.4%, so
> >>> there's some balancing to be done.
> >>>
> >>>> You might be able to manually fix things by using `ceph osd reweight
> >>>> ...` on the most full osds to move data elsewhere.
> >>>
> >>> I'll look into this, but I was hoping that the balancer module would
> >>> take care of this...
> >>>
> >>>>
> >>>> Otherwise, in general, its good to setup monitoring so you notice
> >>>> and take action well before the osds fill up.
> >>>
> >>> Yes, I'm still working on this, I want to add some checks to our
> >>> check_mk+icinga setup using native plugins, but my python skills are
> >>> not quite up to the task, at least, not yet ;-)
> >>>
> >>> Cheers
> >>>
> >>> /Simon
> >>>
> >>>>
> >>>> Cheers, Dan
> >>>>
> >>>> On Mon, Aug 26, 2019 at 11:09 AM Simon Oosthoek
> >>>> <s.oosthoek@xxxxxxxxxxxxx> wrote:
> >>>>>
> >>>>> Hi all,
> >>>>>
> >>>>> we're building up our experience with our ceph cluster before we
> >>>>> take it into production. I've now tried to fill up the cluster with
> >>>>> cephfs, which we plan to use for about 95% of all data on the cluster.
> >>>>>
> >>>>> The cephfs pools are full when the cluster reports 67% raw capacity
> >>>>> used. There are 4 pools we use for cephfs data, 3-copy, 4-copy, EC
> >>>>> 8+3 and EC 5+7. The balancer module is turned on and `ceph balancer
> >>>>> eval` gives `current cluster score 0.013255 (lower is better)`, so
> >>>>> well within the default 5% margin. Is there a setting we can tweak
> >>>>> to increase the usable RAW capacity to say 85% or 90%, or is this
> >>>>> the most we can expect to store on the cluster?
> >>>>>
> >>>>> [root@cephmon1 ~]# ceph df
> >>>>> RAW STORAGE:
> >>>>>         CLASS     SIZE        AVAIL       USED        RAW USED     %RAW
> > USED
> >>>>>         hdd       1.8 PiB     605 TiB     1.2 PiB      1.2 PiB
> > 66.71
> >>>>>         TOTAL     1.8 PiB     605 TiB     1.2 PiB      1.2 PiB
> > 66.71
> >>>>>
> >>>>> POOLS:
> >>>>>         POOL                    ID     STORED      OBJECTS     USED
> >>>>> %USED      MAX AVAIL
> >>>>>         cephfs_data              1     111 MiB      79.26M     1.2 GiB
> >>>>> 100.00           0 B
> >>>>>         cephfs_metadata          2      52 GiB       4.91M      52 GiB
> >>>>> 100.00           0 B
> >>>>>         cephfs_data_4copy        3     106 TiB      46.36M     428 TiB
> >>>>> 100.00           0 B
> >>>>>         cephfs_data_3copy        8      93 TiB      42.08M     282 TiB
> >>>>> 100.00           0 B
> >>>>>         cephfs_data_ec83        13     106 TiB      50.11M     161 TiB
> >>>>> 100.00           0 B
> >>>>>         rbd                     14      21 GiB       5.62k      63 GiB
> >>>>> 100.00           0 B
> >>>>>         .rgw.root               15     1.2 KiB           4       1 MiB
> >>>>> 100.00           0 B
> >>>>>         default.rgw.control     16         0 B           8         0 B
> >>>>>         0           0 B
> >>>>>         default.rgw.meta        17       765 B           4       1 MiB
> >>>>> 100.00           0 B
> >>>>>         default.rgw.log         18         0 B         207         0 B
> >>>>>         0           0 B
> >>>>>         scbench                 19     133 GiB      34.14k     400 GiB
> >>>>> 100.00           0 B
> >>>>>         cephfs_data_ec57        20     126 TiB      51.84M     320 TiB
> >>>>> 100.00           0 B
> >>>>> [root@cephmon1 ~]# ceph balancer eval current cluster score
> >>>>> 0.013255 (lower is better)
> >>>>>
> >>>>>
> >>>>> Being full at 2/3 Raw used is a bit too "pretty" to be accidental,
> >>>>> it seems like this could be a parameter for cephfs, however, I
> >>>>> couldn't find anything like this in the documentation for Nautilus.
> >>>>>
> >>>>>
> >>>>> The logs in the dashboard show this:
> >>>>> 2019-08-26 11:00:00.000630
> >>>>> [ERR]
> >>>>> overall HEALTH_ERR 3 backfillfull osd(s); 1 full osd(s); 12 pool(s)
> >>>>> full
> >>>>>
> >>>>> 2019-08-26 10:57:44.539964
> >>>>> [INF]
> >>>>> Health check cleared: POOL_BACKFILLFULL (was: 12 pool(s)
> >>>>> backfillfull)
> >>>>>
> >>>>> 2019-08-26 10:57:44.539944
> >>>>> [WRN]
> >>>>> Health check failed: 12 pool(s) full (POOL_FULL)
> >>>>>
> >>>>> 2019-08-26 10:57:44.539926
> >>>>> [ERR]
> >>>>> Health check failed: 1 full osd(s) (OSD_FULL)
> >>>>>
> >>>>> 2019-08-26 10:57:44.539899
> >>>>> [WRN]
> >>>>> Health check update: 3 backfillfull osd(s) (OSD_BACKFILLFULL)
> >>>>>
> >>>>> 2019-08-26 10:00:00.000088
> >>>>> [WRN]
> >>>>> overall HEALTH_WARN 4 backfillfull osd(s); 12 pool(s) backfillfull
> >>>>>
> >>>>> So it seems that ceph is completely stuck at 2/3 full, while we
> >>>>> anticipated being able to fill up the cluster to at least 85-90% of
> >>>>> the raw capacity. Or at least so that we would keep a functioning
> >>>>> cluster when we have a single osd node fail.
> >>>>>
> >>>>> Cheers
> >>>>>
> >>>>> /Simon
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list
> >>>>> ceph-users@xxxxxxxxxxxxxx
> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com