Re: OSD META usage growing without bounds

Frank Schilder <frans@xxxxxx> · Tue, 11 Jan 2022 12:21:06 +0000

Hi Igor,

I think you found the issue. I wonder why the backport was rejected for mimic. Almost looks like the push of a wrong button, not intentional. Kind of strange.

The reason it crashes our cluster is probably that the disk size is much smaller than the max replay log size. In case you know about a config option or parameter used at OSD creation to reduce the max replay log size, please let me know.

Thanks for your help and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Igor Fedotov <igor.fedotov@xxxxxxxx>
Sent: 11 January 2022 12:59:21
To: Frank Schilder; ceph-users
Subject: Re:  OSD META usage growing without bounds

And here is an overview from the PR
(https://github.com/ceph/ceph/pull/35473) which looks to some degree
inline with your initial points (cluster in an idle state for a long
period):

"Original problem stemmed from BlueFS inability to replay log, which was
caused by BlueFS previously wrote replay log that was corrupted, which
was caused by BlueFS log growing to extreme size (~600GB), which was
caused by OSD working in a way, when BlueFS::sync_metadata was never
invoked."

Igor

On 1/11/2022 2:52 PM, Igor Fedotov wrote:
> Frank,
>
> btw - are you aware of https://tracker.ceph.com/issues/45903 ?
>
> I can see it was rejected for mimic for whatever reason. Hence I
> presume that might be pretty relevant to your case...
>
>
> Thanks,
>
> Igor
>
> On 1/11/2022 2:45 PM, Frank Schilder wrote:
>> Hi Igor,
>>
>> thanks for your reply. To avoid further OSD fails, I shut down the
>> cluster yesterday. Unfortunately, after restart all OSDs trimmed
>> whatever was filling them up:
>>
>> [root@rit-tceph ~]# ceph osd df tree
>> ID CLASS WEIGHT  REWEIGHT SIZE    USE     DATA    OMAP META
>> AVAIL   %USE VAR  PGS TYPE NAME
>> -1       2.44707        - 2.4 TiB 9.2 GiB 255 MiB 25 KiB  9.0 GiB 2.4
>> TiB 0.37 1.00   - root default
>> -5       0.81569        - 835 GiB 3.1 GiB  85 MiB 19 KiB  3.0 GiB 832
>> GiB 0.37 1.00   -     host tceph-01
>>   0   hdd 0.27190  1.00000 278 GiB 1.0 GiB  28 MiB 19 KiB 1024 MiB
>> 277 GiB 0.37 1.00 169         osd.0
>>   3   hdd 0.27190  1.00000 278 GiB 1.0 GiB  28 MiB    0 B    1 GiB
>> 277 GiB 0.37 1.00 164         osd.3
>>   8   hdd 0.27190  1.00000 278 GiB 1.0 GiB  28 MiB    0 B    1 GiB
>> 277 GiB 0.37 1.00 167         osd.8
>> -3       0.81569        - 835 GiB 3.1 GiB  85 MiB  3 KiB  3.0 GiB 832
>> GiB 0.37 1.00   -     host tceph-02
>>   2   hdd 0.27190  1.00000 278 GiB 1.0 GiB  28 MiB    0 B    1 GiB
>> 277 GiB 0.37 1.00 157         osd.2
>>   4   hdd 0.27190  1.00000 278 GiB 1.0 GiB  29 MiB    0 B    1 GiB
>> 277 GiB 0.37 1.00 172         osd.4
>>   6   hdd 0.27190  1.00000 278 GiB 1.0 GiB  28 MiB  3 KiB 1024 MiB
>> 277 GiB 0.37 1.00 171         osd.6
>> -7       0.81569        - 835 GiB 3.1 GiB  85 MiB  3 KiB  3.0 GiB 832
>> GiB 0.37 1.00   -     host tceph-03
>>   1   hdd 0.27190  1.00000 278 GiB 1.0 GiB  28 MiB    0 B    1 GiB
>> 277 GiB 0.37 1.00 171         osd.1
>>   5   hdd 0.27190  1.00000 278 GiB 1.0 GiB  28 MiB  3 KiB 1024 MiB
>> 277 GiB 0.37 1.00 160         osd.5
>>   7   hdd 0.27190  1.00000 278 GiB 1.0 GiB  28 MiB    0 B    1 GiB
>> 277 GiB 0.37 1.00 169         osd.7
>>                      TOTAL 2.4 TiB 9.2 GiB 255 MiB 25 KiB  9.0 GiB
>> 2.4 TiB 0.37
>> MIN/MAX VAR: 1.00/1.00  STDDEV: 0
>>
>> The OSDs didn't log what they were doing on startup. The log goes
>> straight from bluefs init to PG scrub messages (with a very long wait
>> time in between). Iostat showed very heavy read on the drives during
>> the trimming/boot phase.
>>
>> I'm not sure if it helps to collect perf counters already now. I will
>> wait until I see some unusual growth in META again. I don't think the
>> problem is there from the start, it looked more like the OSDs started
>> filling meta up independently at different times. I will let the
>> cluster sit idle as before and keep watching. Hope I find something.
>>
>> Thanks and best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: Igor Fedotov <igor.fedotov@xxxxxxxx>
>> Sent: 11 January 2022 10:27:14
>> To: Frank Schilder; ceph-users
>> Subject: Re:  OSD META usage growing without bounds
>>
>> Hi Frank,
>>
>> you might want to collect a couple of perf dumps for osd in question in
>> e.g. one hour interval. And inspect what counters are growing in bluefs
>> sections. "log_bytes" is of particular interest...
>>
>>
>> Thanks,
>>
>> Igor
>>
>>
>> On 1/10/2022 2:25 PM, Frank Schilder wrote:
>>> Hi, I'm observing a strange behaviour on a small test cluster
>>> (13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable)).
>>> The cluster is up for about half a year and almost empty. We did a
>>> few rbd bench runs and created a file system, but there was zero
>>> client IO for at least 3 months. It looks like recently the OSD META
>>> usage of some OSDs started to increase for no apparent reason. One
>>> OSD already died with 100% usage and another is on its way. I can't
>>> see any obvious reason for this strange behaviour.
>>>
>>> If anyone has an idea, please let me know.
>>>
>>> Some diagnostic output:
>>>
>>> [root@rit-tceph ~]# ceph status
>>>     cluster:
>>>       id:     bf1f51f5-b381-4cf7-b3db-88d044c1960c
>>>       health: HEALTH_WARN
>>>               1 nearfull osd(s)
>>>               3 pool(s) nearfull
>>>
>>>     services:
>>>       mon: 3 daemons, quorum tceph-01,tceph-02,tceph-03
>>>       mgr: tceph-01(active), standbys: tceph-02, tceph-03
>>>       mds: testfs-1/1/1 up  {0=tceph-01=up:active}, 2 up:standby
>>>       osd: 9 osds: 8 up, 8 in
>>>
>>>     data:
>>>       pools:   3 pools, 500 pgs
>>>       objects: 24  objects, 2.3 KiB
>>>       usage:   746 GiB used, 1.4 TiB / 2.2 TiB avail
>>>       pgs:     500 active+clean
>>>
>>> [root@rit-tceph ~]# ceph df
>>> GLOBAL:
>>>       SIZE        AVAIL       RAW USED     %RAW USED
>>>       2.2 TiB     1.4 TiB      746 GiB         33.49
>>> POOLS:
>>>       NAME                ID     USED        %USED     MAX AVAIL
>>> OBJECTS
>>>       test                1         19 B         0        81
>>> GiB           2
>>>       testfs_data         2          0 B         0        81
>>> GiB           0
>>>       testfs_metadata     3      2.2 KiB         0        81
>>> GiB          22
>>>
>>> [root@rit-tceph ~]# ceph osd df tree
>>> ID CLASS WEIGHT  REWEIGHT SIZE    USE     DATA    OMAP META
>>> AVAIL   %USE  VAR  PGS TYPE NAME
>>> -1       2.44707        - 2.2 TiB 746 GiB 120 MiB 34 KiB 746 GiB 1.4
>>> TiB 33.49 1.00   - root default
>>> -5       0.81569        - 557 GiB 195 GiB  30 MiB  3 KiB 195 GiB 362
>>> GiB 35.04 1.05   -     host tceph-01
>>>    0   hdd 0.27190  1.00000 278 GiB  38 GiB  15 MiB  3 KiB  38 GiB
>>> 241 GiB 13.61 0.41 260         osd.0
>>>    3   hdd 0.27190        0     0 B     0 B     0 B    0 B 0 B     0
>>> B     0    0   0         osd.3
>>>    8   hdd 0.27190  1.00000 278 GiB 157 GiB  15 MiB    0 B 157 GiB
>>> 121 GiB 56.47 1.69 240         osd.8
>>> -3       0.81569        - 835 GiB 113 GiB  45 MiB  3 KiB 113 GiB 723
>>> GiB 13.48 0.40   -     host tceph-02
>>>    2   hdd 0.27190  1.00000 278 GiB  18 GiB  15 MiB    0 B  18 GiB
>>> 261 GiB  6.30 0.19 157         osd.2
>>>    4   hdd 0.27190  1.00000 278 GiB  48 GiB  15 MiB    0 B  48 GiB
>>> 231 GiB 17.21 0.51 172         osd.4
>>>    6   hdd 0.27190  1.00000 278 GiB  47 GiB  15 MiB  3 KiB  47 GiB
>>> 231 GiB 16.93 0.51 171         osd.6
>>> -7       0.81569        - 835 GiB 438 GiB  45 MiB 28 KiB 438 GiB 397
>>> GiB 52.48 1.57   -     host tceph-03
>>>    1   hdd 0.27190  1.00000 278 GiB 238 GiB  15 MiB 25 KiB 238 GiB
>>> 41 GiB 85.35 2.55 171         osd.1
>>>    5   hdd 0.27190  1.00000 278 GiB 200 GiB  15 MiB  3 KiB 200 GiB
>>> 79 GiB 71.68 2.14 160         osd.5
>>>    7   hdd 0.27190  1.00000 278 GiB 1.1 GiB  15 MiB    0 B 1.1 GiB
>>> 277 GiB  0.40 0.01 169         osd.7
>>>                       TOTAL 2.2 TiB 746 GiB 120 MiB 34 KiB 746 GiB
>>> 1.4 TiB 33.49
>>> MIN/MAX VAR: 0.01/2.55  STDDEV: 30.50
>>>
>>> 2 hours later:
>>>
>>> [root@rit-tceph ~]# ceph status
>>>     cluster:
>>>       id:     bf1f51f5-b381-4cf7-b3db-88d044c1960c
>>>       health: HEALTH_WARN
>>>               1 nearfull osd(s)
>>>               3 pool(s) nearfull
>>>
>>>     services:
>>>       mon: 3 daemons, quorum tceph-01,tceph-02,tceph-03
>>>       mgr: tceph-01(active), standbys: tceph-02, tceph-03
>>>       mds: testfs-1/1/1 up  {0=tceph-01=up:active}, 2 up:standby
>>>       osd: 9 osds: 8 up, 8 in
>>>
>>>     data:
>>>       pools:   3 pools, 500 pgs
>>>       objects: 24  objects, 2.3 KiB
>>>       usage:   748 GiB used, 1.4 TiB / 2.2 TiB avail
>>>       pgs:     500 active+clean
>>>
>>> The usage is increasing surprisingly fast.
>>>
>>> Thanks for any pointers!
>>> =================
>>> Frank Schilder
>>> AIT Risø Campus
>>> Bygning 109, rum S14
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> --
>> Igor Fedotov
>> Ceph Lead Developer
>>
>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>
>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>> CEO: Martin Verges - VAT-ID: DE310638492
>> Com. register: Amtsgericht Munich HRB 231263
>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>>
--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx