I found it, it has indeed to do with snapshots, but not in the way I thought: at 04:17:39: HEALTH_ERR 20 large omap objects; 1 pools full LARGE_OMAP_OBJECTS 20 large omap objects 20 large objects found in pool 'con-fs2-meta1' Search the cluster log for 'Large omap object found' for more details. POOL_FULL 1 pools full pool 'sr-rbd-meta-one' has 450 GiB (max 500 GiB) POOLS: NAME ID USED %USED MAX AVAIL OBJECTS sr-rbd-meta-one 1 450 GiB 1.08 40 TiB 123930 at 09:02:27: POOLS: NAME ID USED %USED MAX AVAIL OBJECTS sr-rbd-meta-one 1 91 GiB 0.22 40 TiB 32000 The culprit here is a bug in opennebula. During disk snapshots it stores the memory dump in the RBD meta-data- instead of the RBD data pool (it ignores the data pool definition in the system data store). This leads to an insane temporary usage in the meta data pool. Wanted to report this bug a long time ago. Thanks to everyone who replied. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Eugen Block <eblock@xxxxxx> Sent: 15 September 2021 09:28:15 To: ceph-users@xxxxxxx Subject: Re: Health check failed: 1 pools ful Hi Frank, I think the snapshot rotation could be an explanation. Just a few days ago we had a host failure over night and some OSDs couldn't be rebalanced entirely because they were too full. Deleting a few (large) snapshots I created last week resolved the issue. If you monitored 'ceph osd df' for a couple of days you should probably see spikes in the OSD usage stats. The only difference I see is that we also had 'OSD nearfull' warnings which you don't seem to have, so it might be something else. Zitat von Frank Schilder <frans@xxxxxx>: > It happened again today: > > 2021-09-15 04:25:20.551098 [INF] Health check cleared: > POOL_NEAR_FULL (was: 1 pools nearfull) > 2021-09-15 04:19:01.512425 [INF] Health check cleared: POOL_FULL > (was: 1 pools full) > 2021-09-15 04:19:01.512389 [WRN] Health check failed: 1 pools > nearfull (POOL_NEAR_FULL) > 2021-09-15 04:18:05.015251 [INF] Health check cleared: > POOL_NEAR_FULL (was: 1 pools nearfull) > 2021-09-15 04:18:05.015217 [ERR] Health check failed: 1 pools full > (POOL_FULL) > 2021-09-15 04:13:45.312115 [WRN] Health check failed: 1 pools > nearfull (POOL_NEAR_FULL) > > During this time, we are running snapshot rotation on RBD images. > Could this have anything to do with it? > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Frank Schilder <frans@xxxxxx> > Sent: 13 September 2021 12:20 > To: ceph-users > Subject: Health check failed: 1 pools ful > > Hi all, > > I recently had a strange blip in the ceph logs: > > 2021-09-09 04:19:09.612111 [INF] Health check cleared: > POOL_NEAR_FULL (was: 1 pools nearfull) > 2021-09-09 04:13:18.187602 [INF] Health check cleared: POOL_FULL > (was: 1 pools full) > 2021-09-09 04:13:18.187566 [WRN] Health check failed: 1 pools > nearfull (POOL_NEAR_FULL) > 2021-09-09 04:12:09.078878 [INF] Health check cleared: > POOL_NEAR_FULL (was: 1 pools nearfull) > 2021-09-09 04:12:09.078850 [ERR] Health check failed: 1 pools full > (POOL_FULL) > 2021-09-09 04:08:16.898112 [WRN] Health check failed: 1 pools > nearfull (POOL_NEAR_FULL) > > None of our pools are anywhere near full or close to their quotas: > > # ceph df detail > GLOBAL: > SIZE AVAIL RAW USED %RAW USED OBJECTS > 11 PiB 9.6 PiB 1.8 PiB 16.11 845.1 M > POOLS: > NAME ID QUOTA OBJECTS QUOTA BYTES > USED %USED MAX AVAIL OBJECTS DIRTY READ > WRITE RAW USED > sr-rbd-meta-one 1 N/A 500 GiB > 90 GiB 0.21 41 TiB 31558 31.56 k 799 > MiB 338 MiB 270 GiB > sr-rbd-data-one 2 N/A 70 TiB > 36 TiB 27.96 93 TiB 13966792 13.97 M 4.2 > GiB 2.5 GiB 48 TiB > sr-rbd-one-stretch 3 N/A 1 TiB > 222 GiB 0.52 41 TiB 68813 68.81 k 863 > MiB 860 MiB 667 GiB > con-rbd-meta-hpc-one 7 N/A 10 GiB > 51 KiB 0 1.7 TiB 61 61 7.0 > MiB 3.8 MiB 154 KiB > con-rbd-data-hpc-one 8 N/A 5 TiB > 35 GiB 0 5.9 PiB 9245 9.24 k 144 > MiB 78 MiB 44 GiB > sr-rbd-data-one-hdd 11 N/A 200 TiB > 118 TiB 39.90 177 TiB 31460630 31.46 M 14 > GiB 2.2 GiB 157 TiB > con-fs2-meta1 12 N/A 250 GiB > 2.0 GiB 0.15 1.3 TiB 18045470 18.05 M 20 > MiB 108 MiB 7.9 GiB > con-fs2-meta2 13 N/A 100 GiB > 0 B 0 1.3 TiB 216425275 216.4 M 141 > KiB 7.9 MiB 0 B > con-fs2-data 14 N/A 2.0 PiB > 1.3 PiB 18.41 5.9 PiB 541502957 541.5 M 4.9 > GiB 5.0 GiB 1.7 PiB > con-fs2-data-ec-ssd 17 N/A 1 TiB > 239 GiB 5.29 4.2 TiB 3225690 3.23 M 17 > MiB 0 B 299 GiB > ms-rbd-one 18 N/A 1 TiB > 262 GiB 0.62 41 TiB 73711 73.71 k 4.8 > MiB 1.5 GiB 786 GiB > con-fs2-data2 19 N/A 5 PiB > 29 TiB 0.52 5.4 PiB 20322725 20.32 M 83 > MiB 97 MiB 39 TiB > > I'm not sure if IO stopped, it does not look like it. The blip might > have been artificial. I could not find any information about which > pool(s) was causing this. > > We are running ceph version 13.2.10 > (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable). > > Any ideas what is going on or if this could be a problem? > > Thanks and best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx