osd nearfull is not detected

Konstantin Shalygin <k0ste@xxxxxxxx> · Wed, 21 Apr 2021 21:05:26 +0300

Hi,

On the adopted cluster Prometheus was triggered for "osd full > 90%"
But Ceph itself - not. Actually OSD is drained (see %USE).

root@host# ceph osd df name osd.696
ID  CLASS WEIGHT  REWEIGHT SIZE    RAW USE DATA    OMAP    META    AVAIL  %USE  VAR  PGS STATUS
696  nvme 0.91199  1.00000 912 GiB 830 GiB 684 GiB   8 KiB 146 GiB 81 GiB 91.09 1.00  47     up
                     TOTAL 912 GiB 830 GiB 684 GiB 8.1 KiB 146 GiB 81 GiB 91.09
MIN/MAX VAR: 1.00/1.00  STDDEV: 0
root@host# ceph osd df name osd.696
ID  CLASS WEIGHT  REWEIGHT SIZE    RAW USE DATA    OMAP    META    AVAIL  %USE  VAR  PGS STATUS
696  nvme 0.91199  1.00000 912 GiB 830 GiB 684 GiB   8 KiB 146 GiB 81 GiB 91.08 1.00  47     up
                     TOTAL 912 GiB 830 GiB 684 GiB 8.1 KiB 146 GiB 81 GiB 91.08
MIN/MAX VAR: 1.00/1.00  STDDEV: 0
root@host# ceph osd df name osd.696
ID  CLASS WEIGHT  REWEIGHT SIZE    RAW USE DATA    OMAP    META    AVAIL  %USE  VAR  PGS STATUS
696  nvme 0.91199  1.00000 912 GiB 830 GiB 684 GiB   8 KiB 146 GiB 81 GiB 91.07 1.00  47     up
                     TOTAL 912 GiB 830 GiB 684 GiB 8.1 KiB 146 GiB 81 GiB 91.07
MIN/MAX VAR: 1.00/1.00  STDDEV: 0

Pool 18 is another class pool, OSD's of this pool triggered as usual, but for pool 17 - don't.

root@host# ceph health detail
HEALTH_WARN noout flag(s) set; Some pool(s) have the nodeep-scrub flag(s) set; Low space hindering backfill (add storage if this doesn't resolve itself): 2 pgs backfill_toofull
OSDMAP_FLAGS noout flag(s) set
POOL_SCRUB_FLAGS Some pool(s) have the nodeep-scrub flag(s) set
    Pool meta_ru1b has nodeep-scrub flag
    Pool data_ru1b has nodeep-scrub flag
PG_BACKFILL_FULL Low space hindering backfill (add storage if this doesn't resolve itself): 2 pgs backfill_toofull
    pg 18.1008 is active+remapped+backfill_wait+backfill_toofull, acting [336,462,580]
    pg 18.27e0 is active+remapped+backfill_wait+backfill_toofull, acting [401,627,210]

On my experience , Ceph triggers when OSD drain on backfillfull_ratio, then on nearfull_ratio until
usage will drops to 84.99%
I don't think is to possible to configure silence for this

Current usage:

root@host# ceph df  detail
RAW STORAGE:
    CLASS     SIZE        AVAIL        USED        RAW USED     %RAW USED
    hdd       4.3 PiB     1022 TiB     3.3 PiB      3.3 PiB         76.71
    nvme      161 TiB       61 TiB      82 TiB      100 TiB         62.30
    TOTAL     4.4 PiB      1.1 PiB     3.4 PiB      3.4 PiB         76.20

POOLS:
    POOL          ID     PGS       STORED      OBJECTS     USED        %USED     MAX AVAIL     QUOTA OBJECTS     QUOTA BYTES     DIRTY     USED COMPR     UNDER COMPR
    meta_ru1b     17      2048     3.1 TiB       7.15G      82 TiB     92.77       2.1 TiB     N/A               N/A             7.15G            0 B             0 B
    data_ru1b     18     16384     1.1 PiB       3.07G     3.3 PiB     88.29       148 TiB     N/A               N/A             3.07G            0 B             0 B

Current OSD dump header:

epoch 270540
fsid ccf2c233-4adf-423c-b734-236220096d4e
created 2019-02-14 15:30:56.642918
modified 2021-04-21 20:33:54.481616
flags noout,sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit
crush_version 7255
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client jewel
min_compat_client jewel
require_osd_release nautilus
pool 17 'meta_ru1b' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 2048 pgp_num 2048 autoscale_mode warn last_change 240836 lfor 0/0/51990 flags hashpspool,nodeep-scrub stripe_width 0 application metadata
pool 18 'data_ru1b' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 16384 pgp_num 16384 autoscale_mode warn last_change 270529 lfor 0/0/52038 flags hashpspool,nodeep-scrub stripe_width 0 application data
max_osd 780

Current versions:

{
    "mon": {
        "ceph version 14.2.19 (bb796b9b5bab9463106022eef406373182465d11) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.19 (bb796b9b5bab9463106022eef406373182465d11) nautilus (stable)": 3
    },
    "osd": {
        "ceph version 14.2.19 (bb796b9b5bab9463106022eef406373182465d11) nautilus (stable)": 780
    },
    "mds": {},
    "overall": {
        "ceph version 14.2.19 (bb796b9b5bab9463106022eef406373182465d11) nautilus (stable)": 786
    }
}

Dan, maybe there was something like that in your memory? My guess is that some counter type overflowed

Thanks,
k
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx