Re: osd nearfull is not detected

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Wed, 21 Apr 2021 20:21:50 +0200

Are you currently doing IO on the relevant pool? Maybe nearfull isn't
reported until some pgstats are reported.

Otherwise sorry I haven't seen this.

Dan

On Wed, Apr 21, 2021, 8:05 PM Konstantin Shalygin <k0ste@xxxxxxxx> wrote:

> Hi,
>
> On the adopted cluster Prometheus was triggered for "osd full > 90%"
> But Ceph itself - not. Actually OSD is drained (see %USE).
>
> root@host# ceph osd df name osd.696
> ID  CLASS WEIGHT  REWEIGHT SIZE    RAW USE DATA    OMAP    META    AVAIL
>  %USE  VAR  PGS STATUS
> 696  nvme 0.91199  1.00000 912 GiB 830 GiB 684 GiB   8 KiB 146 GiB 81 GiB
> 91.09 1.00  47     up
>                      TOTAL 912 GiB 830 GiB 684 GiB 8.1 KiB 146 GiB 81 GiB
> 91.09
> MIN/MAX VAR: 1.00/1.00  STDDEV: 0
> root@host# ceph osd df name osd.696
> ID  CLASS WEIGHT  REWEIGHT SIZE    RAW USE DATA    OMAP    META    AVAIL
>  %USE  VAR  PGS STATUS
> 696  nvme 0.91199  1.00000 912 GiB 830 GiB 684 GiB   8 KiB 146 GiB 81 GiB
> 91.08 1.00  47     up
>                      TOTAL 912 GiB 830 GiB 684 GiB 8.1 KiB 146 GiB 81 GiB
> 91.08
> MIN/MAX VAR: 1.00/1.00  STDDEV: 0
> root@host# ceph osd df name osd.696
> ID  CLASS WEIGHT  REWEIGHT SIZE    RAW USE DATA    OMAP    META    AVAIL
>  %USE  VAR  PGS STATUS
> 696  nvme 0.91199  1.00000 912 GiB 830 GiB 684 GiB   8 KiB 146 GiB 81 GiB
> 91.07 1.00  47     up
>                      TOTAL 912 GiB 830 GiB 684 GiB 8.1 KiB 146 GiB 81 GiB
> 91.07
> MIN/MAX VAR: 1.00/1.00  STDDEV: 0
>
> Pool 18 is another class pool, OSD's of this pool triggered as usual, but
> for pool 17 - don't.
>
> root@host# ceph health detail
> HEALTH_WARN noout flag(s) set; Some pool(s) have the nodeep-scrub flag(s)
> set; Low space hindering backfill (add storage if this doesn't resolve
> itself): 2 pgs backfill_toofull
> OSDMAP_FLAGS noout flag(s) set
> POOL_SCRUB_FLAGS Some pool(s) have the nodeep-scrub flag(s) set
>     Pool meta_ru1b has nodeep-scrub flag
>     Pool data_ru1b has nodeep-scrub flag
> PG_BACKFILL_FULL Low space hindering backfill (add storage if this doesn't
> resolve itself): 2 pgs backfill_toofull
>     pg 18.1008 is active+remapped+backfill_wait+backfill_toofull, acting
> [336,462,580]
>     pg 18.27e0 is active+remapped+backfill_wait+backfill_toofull, acting
> [401,627,210]
>
>
> On my experience , Ceph triggers when OSD drain on backfillfull_ratio,
> then on nearfull_ratio until
> usage will drops to 84.99%
> I don't think is to possible to configure silence for this
>
> Current usage:
>
> root@host# ceph df  detail
> RAW STORAGE:
>     CLASS     SIZE        AVAIL        USED        RAW USED     %RAW USED
>     hdd       4.3 PiB     1022 TiB     3.3 PiB      3.3 PiB         76.71
>     nvme      161 TiB       61 TiB      82 TiB      100 TiB         62.30
>     TOTAL     4.4 PiB      1.1 PiB     3.4 PiB      3.4 PiB         76.20
>
> POOLS:
>     POOL          ID     PGS       STORED      OBJECTS     USED
>  %USED     MAX AVAIL     QUOTA OBJECTS     QUOTA BYTES     DIRTY     USED
> COMPR     UNDER COMPR
>     meta_ru1b     17      2048     3.1 TiB       7.15G      82 TiB
> 92.77       2.1 TiB     N/A               N/A             7.15G
>  0 B             0 B
>     data_ru1b     18     16384     1.1 PiB       3.07G     3.3 PiB
> 88.29       148 TiB     N/A               N/A             3.07G
>  0 B             0 B
>
>
> Current OSD dump header:
>
> epoch 270540
> fsid ccf2c233-4adf-423c-b734-236220096d4e
> created 2019-02-14 15:30:56.642918
> modified 2021-04-21 20:33:54.481616
> flags noout,sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit
> crush_version 7255
> full_ratio 0.95
> backfillfull_ratio 0.9
> nearfull_ratio 0.85
> require_min_compat_client jewel
> min_compat_client jewel
> require_osd_release nautilus
> pool 17 'meta_ru1b' replicated size 3 min_size 2 crush_rule 1 object_hash
> rjenkins pg_num 2048 pgp_num 2048 autoscale_mode warn last_change 240836
> lfor 0/0/51990 flags hashpspool,nodeep-scrub stripe_width 0 application
> metadata
> pool 18 'data_ru1b' replicated size 3 min_size 2 crush_rule 0 object_hash
> rjenkins pg_num 16384 pgp_num 16384 autoscale_mode warn last_change 270529
> lfor 0/0/52038 flags hashpspool,nodeep-scrub stripe_width 0 application data
> max_osd 780
>
>
> Current versions:
>
> {
>     "mon": {
>         "ceph version 14.2.19 (bb796b9b5bab9463106022eef406373182465d11)
> nautilus (stable)": 3
>     },
>     "mgr": {
>         "ceph version 14.2.19 (bb796b9b5bab9463106022eef406373182465d11)
> nautilus (stable)": 3
>     },
>     "osd": {
>         "ceph version 14.2.19 (bb796b9b5bab9463106022eef406373182465d11)
> nautilus (stable)": 780
>     },
>     "mds": {},
>     "overall": {
>         "ceph version 14.2.19 (bb796b9b5bab9463106022eef406373182465d11)
> nautilus (stable)": 786
>     }
> }
>
>
>
> Dan, maybe there was something like that in your memory? My guess is that
> some counter type overflowed
>
>
>
> Thanks,
> k
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx