Re: Health check failed: 1 pools ful

Frank Schilder <frans@xxxxxx> · Thu, 16 Sep 2021 07:19:38 +0000

I found it, it has indeed to do with snapshots, but not in the way I thought:

at 04:17:39:

HEALTH_ERR 20 large omap objects; 1 pools full
LARGE_OMAP_OBJECTS 20 large omap objects
    20 large objects found in pool 'con-fs2-meta1'
    Search the cluster log for 'Large omap object found' for more details.
POOL_FULL 1 pools full
    pool 'sr-rbd-meta-one' has 450 GiB (max 500 GiB)

POOLS:
    NAME                     ID     USED        %USED     MAX AVAIL     OBJECTS
    sr-rbd-meta-one          1      450 GiB      1.08        40 TiB        123930

at 09:02:27:

POOLS:
    NAME                     ID     USED        %USED     MAX AVAIL     OBJECTS
    sr-rbd-meta-one          1       91 GiB      0.22        40 TiB         32000

The culprit here is a bug in opennebula. During disk snapshots it stores the memory dump in the RBD meta-data- instead of the RBD data pool (it ignores the data pool definition in the system data store). This leads to an insane temporary usage in the meta data pool. Wanted to report this bug a long time ago.

Thanks to everyone who replied.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Eugen Block <eblock@xxxxxx>
Sent: 15 September 2021 09:28:15
To: ceph-users@xxxxxxx
Subject:  Re: Health check failed: 1 pools ful

Hi Frank,

I think the snapshot rotation could be an explanation.
Just a few days ago we had a host failure over night and some OSDs
couldn't be rebalanced entirely because they were too full. Deleting a
few (large) snapshots I created last week resolved the issue. If you
monitored 'ceph osd df' for a couple of days you should probably see
spikes in the OSD usage stats. The only difference I see is that we
also had 'OSD nearfull' warnings which you don't seem to have, so it
might be something else.

Zitat von Frank Schilder <frans@xxxxxx>:

> It happened again today:
>
> 2021-09-15 04:25:20.551098 [INF]  Health check cleared:
> POOL_NEAR_FULL (was: 1 pools nearfull)
> 2021-09-15 04:19:01.512425 [INF]  Health check cleared: POOL_FULL
> (was: 1 pools full)
> 2021-09-15 04:19:01.512389 [WRN]  Health check failed: 1 pools
> nearfull (POOL_NEAR_FULL)
> 2021-09-15 04:18:05.015251 [INF]  Health check cleared:
> POOL_NEAR_FULL (was: 1 pools nearfull)
> 2021-09-15 04:18:05.015217 [ERR]  Health check failed: 1 pools full
> (POOL_FULL)
> 2021-09-15 04:13:45.312115 [WRN]  Health check failed: 1 pools
> nearfull (POOL_NEAR_FULL)
>
> During this time, we are running snapshot rotation on RBD images.
> Could this have anything to do with it?
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Frank Schilder <frans@xxxxxx>
> Sent: 13 September 2021 12:20
> To: ceph-users
> Subject:  Health check failed: 1 pools ful
>
> Hi all,
>
> I recently had a strange blip in the ceph logs:
>
> 2021-09-09 04:19:09.612111 [INF]  Health check cleared:
> POOL_NEAR_FULL (was: 1 pools nearfull)
> 2021-09-09 04:13:18.187602 [INF]  Health check cleared: POOL_FULL
> (was: 1 pools full)
> 2021-09-09 04:13:18.187566 [WRN]  Health check failed: 1 pools
> nearfull (POOL_NEAR_FULL)
> 2021-09-09 04:12:09.078878 [INF]  Health check cleared:
> POOL_NEAR_FULL (was: 1 pools nearfull)
> 2021-09-09 04:12:09.078850 [ERR]  Health check failed: 1 pools full
> (POOL_FULL)
> 2021-09-09 04:08:16.898112 [WRN]  Health check failed: 1 pools
> nearfull (POOL_NEAR_FULL)
>
> None of our pools are anywhere near full or close to their quotas:
>
> # ceph df detail
> GLOBAL:
>     SIZE       AVAIL       RAW USED     %RAW USED     OBJECTS
>     11 PiB     9.6 PiB      1.8 PiB         16.11     845.1 M
> POOLS:
>     NAME                     ID     QUOTA OBJECTS     QUOTA BYTES
>  USED        %USED     MAX AVAIL     OBJECTS       DIRTY       READ
>       WRITE       RAW USED
>     sr-rbd-meta-one          1      N/A               500 GiB
>   90 GiB      0.21        41 TiB         31558     31.56 k     799
> MiB     338 MiB      270 GiB
>     sr-rbd-data-one          2      N/A               70 TiB
>   36 TiB     27.96        93 TiB      13966792     13.97 M     4.2
> GiB     2.5 GiB       48 TiB
>     sr-rbd-one-stretch       3      N/A               1 TiB
>  222 GiB      0.52        41 TiB         68813     68.81 k     863
> MiB     860 MiB      667 GiB
>     con-rbd-meta-hpc-one     7      N/A               10 GiB
>   51 KiB         0       1.7 TiB            61         61      7.0
> MiB     3.8 MiB      154 KiB
>     con-rbd-data-hpc-one     8      N/A               5 TiB
>   35 GiB         0       5.9 PiB          9245      9.24 k     144
> MiB      78 MiB       44 GiB
>     sr-rbd-data-one-hdd      11     N/A               200 TiB
>  118 TiB     39.90       177 TiB      31460630     31.46 M      14
> GiB     2.2 GiB      157 TiB
>     con-fs2-meta1            12     N/A               250 GiB
>  2.0 GiB      0.15       1.3 TiB      18045470     18.05 M      20
> MiB     108 MiB      7.9 GiB
>     con-fs2-meta2            13     N/A               100 GiB
>      0 B         0       1.3 TiB     216425275     216.4 M     141
> KiB     7.9 MiB          0 B
>     con-fs2-data             14     N/A               2.0 PiB
>  1.3 PiB     18.41       5.9 PiB     541502957     541.5 M     4.9
> GiB     5.0 GiB      1.7 PiB
>     con-fs2-data-ec-ssd      17     N/A               1 TiB
>  239 GiB      5.29       4.2 TiB       3225690      3.23 M      17
> MiB         0 B      299 GiB
>     ms-rbd-one               18     N/A               1 TiB
>  262 GiB      0.62        41 TiB         73711     73.71 k     4.8
> MiB     1.5 GiB      786 GiB
>     con-fs2-data2            19     N/A               5 PiB
>   29 TiB      0.52       5.4 PiB      20322725     20.32 M      83
> MiB      97 MiB       39 TiB
>
> I'm not sure if IO stopped, it does not look like it. The blip might
> have been artificial. I could not find any information about which
> pool(s) was causing this.
>
> We are running ceph version 13.2.10
> (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable).
>
> Any ideas what is going on or if this could be a problem?
>
> Thanks and best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx