And here is an overview from the PR
(https://github.com/ceph/ceph/pull/35473) which looks to some degree
inline with your initial points (cluster in an idle state for a long
period):
"Original problem stemmed from BlueFS inability to replay log, which was
caused by BlueFS previously wrote replay log that was corrupted, which
was caused by BlueFS log growing to extreme size (~600GB), which was
caused by OSD working in a way, when BlueFS::sync_metadata was never
invoked."
Igor
On 1/11/2022 2:52 PM, Igor Fedotov wrote:
Frank,
btw - are you aware of https://tracker.ceph.com/issues/45903 ?
I can see it was rejected for mimic for whatever reason. Hence I
presume that might be pretty relevant to your case...
Thanks,
Igor
On 1/11/2022 2:45 PM, Frank Schilder wrote:
Hi Igor,
thanks for your reply. To avoid further OSD fails, I shut down the
cluster yesterday. Unfortunately, after restart all OSDs trimmed
whatever was filling them up:
[root@rit-tceph ~]# ceph osd df tree
ID CLASS WEIGHT REWEIGHT SIZE USE DATA OMAP META
AVAIL %USE VAR PGS TYPE NAME
-1 2.44707 - 2.4 TiB 9.2 GiB 255 MiB 25 KiB 9.0 GiB 2.4
TiB 0.37 1.00 - root default
-5 0.81569 - 835 GiB 3.1 GiB 85 MiB 19 KiB 3.0 GiB 832
GiB 0.37 1.00 - host tceph-01
0 hdd 0.27190 1.00000 278 GiB 1.0 GiB 28 MiB 19 KiB 1024 MiB
277 GiB 0.37 1.00 169 osd.0
3 hdd 0.27190 1.00000 278 GiB 1.0 GiB 28 MiB 0 B 1 GiB
277 GiB 0.37 1.00 164 osd.3
8 hdd 0.27190 1.00000 278 GiB 1.0 GiB 28 MiB 0 B 1 GiB
277 GiB 0.37 1.00 167 osd.8
-3 0.81569 - 835 GiB 3.1 GiB 85 MiB 3 KiB 3.0 GiB 832
GiB 0.37 1.00 - host tceph-02
2 hdd 0.27190 1.00000 278 GiB 1.0 GiB 28 MiB 0 B 1 GiB
277 GiB 0.37 1.00 157 osd.2
4 hdd 0.27190 1.00000 278 GiB 1.0 GiB 29 MiB 0 B 1 GiB
277 GiB 0.37 1.00 172 osd.4
6 hdd 0.27190 1.00000 278 GiB 1.0 GiB 28 MiB 3 KiB 1024 MiB
277 GiB 0.37 1.00 171 osd.6
-7 0.81569 - 835 GiB 3.1 GiB 85 MiB 3 KiB 3.0 GiB 832
GiB 0.37 1.00 - host tceph-03
1 hdd 0.27190 1.00000 278 GiB 1.0 GiB 28 MiB 0 B 1 GiB
277 GiB 0.37 1.00 171 osd.1
5 hdd 0.27190 1.00000 278 GiB 1.0 GiB 28 MiB 3 KiB 1024 MiB
277 GiB 0.37 1.00 160 osd.5
7 hdd 0.27190 1.00000 278 GiB 1.0 GiB 28 MiB 0 B 1 GiB
277 GiB 0.37 1.00 169 osd.7
TOTAL 2.4 TiB 9.2 GiB 255 MiB 25 KiB 9.0 GiB
2.4 TiB 0.37
MIN/MAX VAR: 1.00/1.00 STDDEV: 0
The OSDs didn't log what they were doing on startup. The log goes
straight from bluefs init to PG scrub messages (with a very long wait
time in between). Iostat showed very heavy read on the drives during
the trimming/boot phase.
I'm not sure if it helps to collect perf counters already now. I will
wait until I see some unusual growth in META again. I don't think the
problem is there from the start, it looked more like the OSDs started
filling meta up independently at different times. I will let the
cluster sit idle as before and keep watching. Hope I find something.
Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Igor Fedotov <igor.fedotov@xxxxxxxx>
Sent: 11 January 2022 10:27:14
To: Frank Schilder; ceph-users
Subject: Re: OSD META usage growing without bounds
Hi Frank,
you might want to collect a couple of perf dumps for osd in question in
e.g. one hour interval. And inspect what counters are growing in bluefs
sections. "log_bytes" is of particular interest...
Thanks,
Igor
On 1/10/2022 2:25 PM, Frank Schilder wrote:
Hi, I'm observing a strange behaviour on a small test cluster
(13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable)).
The cluster is up for about half a year and almost empty. We did a
few rbd bench runs and created a file system, but there was zero
client IO for at least 3 months. It looks like recently the OSD META
usage of some OSDs started to increase for no apparent reason. One
OSD already died with 100% usage and another is on its way. I can't
see any obvious reason for this strange behaviour.
If anyone has an idea, please let me know.
Some diagnostic output:
[root@rit-tceph ~]# ceph status
cluster:
id: bf1f51f5-b381-4cf7-b3db-88d044c1960c
health: HEALTH_WARN
1 nearfull osd(s)
3 pool(s) nearfull
services:
mon: 3 daemons, quorum tceph-01,tceph-02,tceph-03
mgr: tceph-01(active), standbys: tceph-02, tceph-03
mds: testfs-1/1/1 up {0=tceph-01=up:active}, 2 up:standby
osd: 9 osds: 8 up, 8 in
data:
pools: 3 pools, 500 pgs
objects: 24 objects, 2.3 KiB
usage: 746 GiB used, 1.4 TiB / 2.2 TiB avail
pgs: 500 active+clean
[root@rit-tceph ~]# ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
2.2 TiB 1.4 TiB 746 GiB 33.49
POOLS:
NAME ID USED %USED MAX AVAIL
OBJECTS
test 1 19 B 0 81
GiB 2
testfs_data 2 0 B 0 81
GiB 0
testfs_metadata 3 2.2 KiB 0 81
GiB 22
[root@rit-tceph ~]# ceph osd df tree
ID CLASS WEIGHT REWEIGHT SIZE USE DATA OMAP META
AVAIL %USE VAR PGS TYPE NAME
-1 2.44707 - 2.2 TiB 746 GiB 120 MiB 34 KiB 746 GiB 1.4
TiB 33.49 1.00 - root default
-5 0.81569 - 557 GiB 195 GiB 30 MiB 3 KiB 195 GiB 362
GiB 35.04 1.05 - host tceph-01
0 hdd 0.27190 1.00000 278 GiB 38 GiB 15 MiB 3 KiB 38 GiB
241 GiB 13.61 0.41 260 osd.0
3 hdd 0.27190 0 0 B 0 B 0 B 0 B 0 B 0
B 0 0 0 osd.3
8 hdd 0.27190 1.00000 278 GiB 157 GiB 15 MiB 0 B 157 GiB
121 GiB 56.47 1.69 240 osd.8
-3 0.81569 - 835 GiB 113 GiB 45 MiB 3 KiB 113 GiB 723
GiB 13.48 0.40 - host tceph-02
2 hdd 0.27190 1.00000 278 GiB 18 GiB 15 MiB 0 B 18 GiB
261 GiB 6.30 0.19 157 osd.2
4 hdd 0.27190 1.00000 278 GiB 48 GiB 15 MiB 0 B 48 GiB
231 GiB 17.21 0.51 172 osd.4
6 hdd 0.27190 1.00000 278 GiB 47 GiB 15 MiB 3 KiB 47 GiB
231 GiB 16.93 0.51 171 osd.6
-7 0.81569 - 835 GiB 438 GiB 45 MiB 28 KiB 438 GiB 397
GiB 52.48 1.57 - host tceph-03
1 hdd 0.27190 1.00000 278 GiB 238 GiB 15 MiB 25 KiB 238 GiB
41 GiB 85.35 2.55 171 osd.1
5 hdd 0.27190 1.00000 278 GiB 200 GiB 15 MiB 3 KiB 200 GiB
79 GiB 71.68 2.14 160 osd.5
7 hdd 0.27190 1.00000 278 GiB 1.1 GiB 15 MiB 0 B 1.1 GiB
277 GiB 0.40 0.01 169 osd.7
TOTAL 2.2 TiB 746 GiB 120 MiB 34 KiB 746 GiB
1.4 TiB 33.49
MIN/MAX VAR: 0.01/2.55 STDDEV: 30.50
2 hours later:
[root@rit-tceph ~]# ceph status
cluster:
id: bf1f51f5-b381-4cf7-b3db-88d044c1960c
health: HEALTH_WARN
1 nearfull osd(s)
3 pool(s) nearfull
services:
mon: 3 daemons, quorum tceph-01,tceph-02,tceph-03
mgr: tceph-01(active), standbys: tceph-02, tceph-03
mds: testfs-1/1/1 up {0=tceph-01=up:active}, 2 up:standby
osd: 9 osds: 8 up, 8 in
data:
pools: 3 pools, 500 pgs
objects: 24 objects, 2.3 KiB
usage: 748 GiB used, 1.4 TiB / 2.2 TiB avail
pgs: 500 active+clean
The usage is increasing surprisingly fast.
Thanks for any pointers!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
--
Igor Fedotov
Ceph Lead Developer
Looking for help with your Ceph cluster? Contact us at https://croit.io
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
--
Igor Fedotov
Ceph Lead Developer
Looking for help with your Ceph cluster? Contact us at https://croit.io
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx