Re: cephfs full, 2/3 Raw capacity used

Wido den Hollander <wido@xxxxxxxx> · Mon, 26 Aug 2019 13:11:12 +0200

On 8/26/19 12:33 PM, Simon Oosthoek wrote:
> On 26-08-19 12:00, EDH - Manuel Rios Fernandez wrote:
>> Balancer just balance in Healthy mode.
>>
>> The problem is that data is distributed without be balanced in their
>> first
>> write, that cause unproperly data balanced across osd.
> 
> I suppose the crush algorithm doesn't take the fullness of the osds into
> account when placing objects...
> 

No, it doesn't. Objects are allocated to a Placement Group based on
their name (hash of it) and the amount of PGs for that pool.

There is no database where objects are. Clients (librados) calculate
this based on the object's name and the OSDMap (which contains the
CRUSHMap).

The allocation of the OSD isn't taken into account as this will result a
in different outcome every time and thus won't let you find your objects
after storing them.

>>
>> This problem only happens in CEPH, we are the same with 14.2.2, having to
>> change the weight manually.Because the balancer is a passive element
>> of the
>> cluster.
>>
>> I hope in next version we get a more aggressive balancer, like
>> enterprises
>> storages that allow to fill up 95% storage (raw).
> 
> I'm thinking a cronjob with a script to parse the output of `ceph osd df
> tree` and reweight according to the percentage used would be relatively
> easy to write. But I'll concentrate on monitoring before I start
> tweaking there ;-)
> 

The reweight might actually cause even more confusion for the balancer.
The balancer uses upmap mode and that re-allocates PGs to different OSDs
if needed.

Looking at the output send earlier I have some replies. See below.

> Cheers
> 
> /Simon
> 
>>
>> Regards
>>
>>
>> -----Mensaje original-----
>> De: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> En nombre de Simon
>> Oosthoek
>> Enviado el: lunes, 26 de agosto de 2019 11:52
>> Para: Dan van der Ster <dan@xxxxxxxxxxxxxx>
>> CC: ceph-users <ceph-users@xxxxxxxxxxxxxx>
>> Asunto: Re:  cephfs full, 2/3 Raw capacity used
>>
>> On 26-08-19 11:37, Dan van der Ster wrote:
>>> Thanks. The version and balancer config look good.
>>>
>>> So you can try `ceph osd reweight osd.10 0.8` to see if it helps to
>>> get you out of this.
>>
>> I've done this and the next fullest 3 osds. This will take some time to
>> recover, I'll let you know when it's done.
>>
>> Thanks,
>>
>> /simon
>>
>>>
>>> -- dan
>>>
>>> On Mon, Aug 26, 2019 at 11:35 AM Simon Oosthoek
>>> <s.oosthoek@xxxxxxxxxxxxx> wrote:
>>>>
>>>> On 26-08-19 11:16, Dan van der Ster wrote:
>>>>> Hi,
>>>>>
>>>>> Which version of ceph are you using? Which balancer mode?
>>>>
>>>> Nautilus (14.2.2), balancer is in upmap mode.
>>>>
>>>>> The balancer score isn't a percent-error or anything humanly usable.
>>>>> `ceph osd df tree` can better show you exactly which osds are
>>>>> over/under utilized and by how much.
>>>>>
>>>>
>>>> Aha, I ran this and sorted on the %full column:
>>>>
>>>>     81   hdd   10.81149  1.00000  11 TiB 5.2 TiB 5.1 TiB   4 KiB  14
>>>> GiB
>>>> 5.6 TiB 48.40 0.73  96     up                 osd.81
>>>>     48   hdd   10.81149  1.00000  11 TiB 5.3 TiB 5.2 TiB  15 KiB  14
>>>> GiB
>>>> 5.5 TiB 49.08 0.74  95     up                 osd.48
>>>> 154   hdd   10.81149  1.00000  11 TiB 5.5 TiB 5.4 TiB 2.6 GiB  15 GiB
>>>> 5.3 TiB 50.95 0.76  96     up                 osd.154
>>>> 129   hdd   10.81149  1.00000  11 TiB 5.5 TiB 5.4 TiB 5.1 GiB  16 GiB
>>>> 5.3 TiB 51.33 0.77  96     up                 osd.129
>>>>     42   hdd   10.81149  1.00000  11 TiB 5.6 TiB 5.5 TiB 2.6 GiB  14
>>>> GiB
>>>> 5.2 TiB 51.81 0.78  96     up                 osd.42
>>>> 122   hdd   10.81149  1.00000  11 TiB 5.7 TiB 5.6 TiB  16 KiB  14 GiB
>>>> 5.1 TiB 52.47 0.79  96     up                 osd.122
>>>> 120   hdd   10.81149  1.00000  11 TiB 5.7 TiB 5.6 TiB 2.6 GiB  15 GiB
>>>> 5.1 TiB 52.92 0.79  95     up                 osd.120
>>>>     96   hdd   10.81149  1.00000  11 TiB 5.8 TiB 5.7 TiB 2.6 GiB  15
>>>> GiB
>>>> 5.0 TiB 53.58 0.80  96     up                 osd.96
>>>>     26   hdd   10.81149  1.00000  11 TiB 5.8 TiB 5.7 TiB  20 KiB  15
>>>> GiB
>>>> 5.0 TiB 53.68 0.80  97     up                 osd.26
>>>> ...
>>>>      6   hdd   10.81149  1.00000  11 TiB 8.3 TiB 8.2 TiB  88 KiB  18
>>>> GiB
>>>> 2.5 TiB 77.14 1.16  96     up                 osd.6
>>>>     16   hdd   10.81149  1.00000  11 TiB 8.4 TiB 8.3 TiB  28 KiB  18
>>>> GiB
>>>> 2.4 TiB 77.56 1.16  95     up                 osd.16
>>>>      0   hdd   10.81149  1.00000  11 TiB 8.6 TiB 8.4 TiB  48 KiB  17
>>>> GiB
>>>> 2.2 TiB 79.24 1.19  96     up                 osd.0
>>>> 144   hdd   10.81149  1.00000  11 TiB 8.6 TiB 8.5 TiB 2.6 GiB  18 GiB
>>>> 2.2 TiB 79.57 1.19  95     up                 osd.144
>>>> 136   hdd   10.81149  1.00000  11 TiB 8.6 TiB 8.5 TiB  48 KiB  17 GiB
>>>> 2.2 TiB 79.60 1.19  95     up                 osd.136
>>>>     63   hdd   10.81149  1.00000  11 TiB 8.6 TiB 8.5 TiB 2.6 GiB  17
>>>> GiB
>>>> 2.2 TiB 79.60 1.19  95     up                 osd.63
>>>> 155   hdd   10.81149  1.00000  11 TiB 8.6 TiB 8.5 TiB   8 KiB  19 GiB
>>>> 2.2 TiB 79.85 1.20  95     up                 osd.155
>>>>     89   hdd   10.81149  1.00000  11 TiB 8.7 TiB 8.5 TiB  12 KiB  20
>>>> GiB
>>>> 2.2 TiB 80.04 1.20  96     up                 osd.89
>>>> 106   hdd   10.81149  1.00000  11 TiB 8.8 TiB 8.7 TiB  64 KiB  19 GiB
>>>> 2.0 TiB 81.38 1.22  96     up                 osd.106
>>>>     94   hdd   10.81149  1.00000  11 TiB 9.0 TiB 8.9 TiB     0 B  19
>>>> GiB
>>>> 1.8 TiB 83.53 1.25  96     up                 osd.94
>>>>     33   hdd   10.81149  1.00000  11 TiB 9.1 TiB 9.0 TiB  44 KiB  19
>>>> GiB
>>>> 1.7 TiB 84.40 1.27  96     up                 osd.33
>>>>     15   hdd   10.81149  1.00000  11 TiB  10 TiB 9.8 TiB  16 KiB  20
>>>> GiB
>>>> 877 GiB 92.08 1.38  96     up                 osd.15
>>>>     53   hdd   10.81149  1.00000  11 TiB  10 TiB  10 TiB 2.6 GiB  20
>>>> GiB
>>>> 676 GiB 93.90 1.41  96     up                 osd.53
>>>>     51   hdd   10.81149  1.00000  11 TiB  10 TiB  10 TiB 2.6 GiB  20
>>>> GiB
>>>> 666 GiB 93.98 1.41  96     up                 osd.51
>>>>     10   hdd   10.81149  1.00000  11 TiB  10 TiB  10 TiB  40 KiB  22
>>>> GiB
>>>> 552 GiB 95.01 1.42  97     up                 osd.10
>>>>
>>>> So the fullest one is at 95.01%, the emptiest one at 48.4%, so
>>>> there's some balancing to be done.
>>>>

Looking at this output the balancing seems OK, but from a different
perspective.

PGs are allocated to OSDs and not Objects nor data. All OSDs have 95~97
Placement Groups allocated.

That's good! A almost perfect distribution.

The problem that now rises is the difference in the size of these
Placement Groups as they hold different objects.

This is one of the side-effects of larger disks. The PGs on them will
grow and this will lead to inbalance between the OSDs.

I *think* that increasing the amount of PGs on this cluster would help,
but only for the pools which will contain most of the data.

This will consume a bit more CPU Power and Memory, but on modern systems
this should be less of a problem.

The good thing is that with Nautilus you can also scale down on the
amount of PGs if things would become a problem.

More PGs will mean smaller PGs and thus lead to a better data distribution.

>>>>> You might be able to manually fix things by using `ceph osd reweight
>>>>> ...` on the most full osds to move data elsewhere.
>>>>
>>>> I'll look into this, but I was hoping that the balancer module would
>>>> take care of this...
>>>>
>>>>>
>>>>> Otherwise, in general, its good to setup monitoring so you notice
>>>>> and take action well before the osds fill up.
>>>>
>>>> Yes, I'm still working on this, I want to add some checks to our
>>>> check_mk+icinga setup using native plugins, but my python skills are
>>>> not quite up to the task, at least, not yet ;-)
>>>>
>>>> Cheers
>>>>
>>>> /Simon
>>>>
>>>>>
>>>>> Cheers, Dan
>>>>>
>>>>> On Mon, Aug 26, 2019 at 11:09 AM Simon Oosthoek
>>>>> <s.oosthoek@xxxxxxxxxxxxx> wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> we're building up our experience with our ceph cluster before we
>>>>>> take it into production. I've now tried to fill up the cluster with
>>>>>> cephfs, which we plan to use for about 95% of all data on the
>>>>>> cluster.
>>>>>>
>>>>>> The cephfs pools are full when the cluster reports 67% raw capacity
>>>>>> used. There are 4 pools we use for cephfs data, 3-copy, 4-copy, EC
>>>>>> 8+3 and EC 5+7. The balancer module is turned on and `ceph balancer
>>>>>> eval` gives `current cluster score 0.013255 (lower is better)`, so
>>>>>> well within the default 5% margin. Is there a setting we can tweak
>>>>>> to increase the usable RAW capacity to say 85% or 90%, or is this
>>>>>> the most we can expect to store on the cluster?
>>>>>>
>>>>>> [root@cephmon1 ~]# ceph df
>>>>>> RAW STORAGE:
>>>>>>         CLASS     SIZE        AVAIL       USED        RAW USED    
>>>>>> %RAW
>> USED
>>>>>>         hdd       1.8 PiB     605 TiB     1.2 PiB      1.2 PiB
>> 66.71
>>>>>>         TOTAL     1.8 PiB     605 TiB     1.2 PiB      1.2 PiB
>> 66.71
>>>>>>
>>>>>> POOLS:
>>>>>>         POOL                    ID     STORED      OBJECTS     USED
>>>>>> %USED      MAX AVAIL
>>>>>>         cephfs_data              1     111 MiB      79.26M     1.2
>>>>>> GiB
>>>>>> 100.00           0 B
>>>>>>         cephfs_metadata          2      52 GiB       4.91M      52
>>>>>> GiB
>>>>>> 100.00           0 B
>>>>>>         cephfs_data_4copy        3     106 TiB      46.36M     428
>>>>>> TiB
>>>>>> 100.00           0 B
>>>>>>         cephfs_data_3copy        8      93 TiB      42.08M     282
>>>>>> TiB
>>>>>> 100.00           0 B
>>>>>>         cephfs_data_ec83        13     106 TiB      50.11M     161
>>>>>> TiB
>>>>>> 100.00           0 B
>>>>>>         rbd                     14      21 GiB       5.62k      63
>>>>>> GiB
>>>>>> 100.00           0 B
>>>>>>         .rgw.root               15     1.2 KiB           4       1
>>>>>> MiB
>>>>>> 100.00           0 B
>>>>>>         default.rgw.control     16         0 B           8        
>>>>>> 0 B
>>>>>>         0           0 B
>>>>>>         default.rgw.meta        17       765 B           4       1
>>>>>> MiB
>>>>>> 100.00           0 B
>>>>>>         default.rgw.log         18         0 B         207        
>>>>>> 0 B
>>>>>>         0           0 B
>>>>>>         scbench                 19     133 GiB      34.14k     400
>>>>>> GiB
>>>>>> 100.00           0 B
>>>>>>         cephfs_data_ec57        20     126 TiB      51.84M     320
>>>>>> TiB
>>>>>> 100.00           0 B
>>>>>> [root@cephmon1 ~]# ceph balancer eval current cluster score
>>>>>> 0.013255 (lower is better)
>>>>>>
>>>>>>
>>>>>> Being full at 2/3 Raw used is a bit too "pretty" to be accidental,
>>>>>> it seems like this could be a parameter for cephfs, however, I
>>>>>> couldn't find anything like this in the documentation for Nautilus.
>>>>>>
>>>>>>
>>>>>> The logs in the dashboard show this:
>>>>>> 2019-08-26 11:00:00.000630
>>>>>> [ERR]
>>>>>> overall HEALTH_ERR 3 backfillfull osd(s); 1 full osd(s); 12 pool(s)
>>>>>> full
>>>>>>
>>>>>> 2019-08-26 10:57:44.539964
>>>>>> [INF]
>>>>>> Health check cleared: POOL_BACKFILLFULL (was: 12 pool(s)
>>>>>> backfillfull)
>>>>>>
>>>>>> 2019-08-26 10:57:44.539944
>>>>>> [WRN]
>>>>>> Health check failed: 12 pool(s) full (POOL_FULL)
>>>>>>
>>>>>> 2019-08-26 10:57:44.539926
>>>>>> [ERR]
>>>>>> Health check failed: 1 full osd(s) (OSD_FULL)
>>>>>>
>>>>>> 2019-08-26 10:57:44.539899
>>>>>> [WRN]
>>>>>> Health check update: 3 backfillfull osd(s) (OSD_BACKFILLFULL)
>>>>>>
>>>>>> 2019-08-26 10:00:00.000088
>>>>>> [WRN]
>>>>>> overall HEALTH_WARN 4 backfillfull osd(s); 12 pool(s) backfillfull
>>>>>>
>>>>>> So it seems that ceph is completely stuck at 2/3 full, while we
>>>>>> anticipated being able to fill up the cluster to at least 85-90% of
>>>>>> the raw capacity. Or at least so that we would keep a functioning
>>>>>> cluster when we have a single osd node fail.
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> /Simon
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com