Re: ceph df reports incorrect stats

"Bailey Allison" <ballison@xxxxxxxxxxxx> · Wed, 6 Dec 2023 10:11:23 -0400

Hey Frank,

+1 to this, we've seen it a few times now. I've attached an output of ceph
df from an internal cluster we have with the same issue.

[root@Cluster1 ~]# ceph df
--- RAW STORAGE ---
CLASS      SIZE     AVAIL    USED     RAW USED  %RAW USED
fast_nvme  596 GiB  595 GiB   50 MiB   1.0 GiB       0.18
hdd        653 TiB  648 TiB  5.3 TiB   5.4 TiB       0.82
nvme       1.6 TiB  1.6 TiB  251 MiB   5.2 GiB       0.32
ssd         24 TiB   22 TiB  2.8 TiB   2.8 TiB      11.65
TOTAL      680 TiB  671 TiB  8.1 TiB   8.2 TiB       1.21

--- POOLS ---
POOL                       ID  PGS   STORED   OBJECTS  USED     %USED  MAX
AVAIL
device_health_metrics       1     1  166 MiB      116  166 MiB      0    127
TiB
cephfs_data                 3   128  214 GiB  767.78k  214 GiB   0.05    127
TiB
cephfs_metadata             4    32  967 MiB    1.19k  967 MiB      0    6.2
TiB
cephfs_42                   5   512  1.4 TiB  530.33k  1.4 TiB   0.36    254
TiB
cephfs_53                   6  1024  1.7 TiB  435.12k  1.7 TiB   0.43    238
TiB
rbd                         9  1024  1.6 TiB  446.52k  1.6 TiB   7.99    9.4
TiB
cephfs_ssd                 14   512   10 GiB    2.64k   10 GiB   0.05    6.2
TiB
cephfs_42_ssd              15   512   13 GiB   18.07k   13 GiB      0    254
TiB
default.rgw.log            22    32  3.4 KiB      207  3.4 KiB      0    127
TiB
default.rgw.meta           23    32  1.1 KiB        7  1.1 KiB      0    127
TiB
.rgw.root                  24    32  1.3 KiB        4  1.3 KiB      0    127
TiB
default.rgw.control        25    32      0 B        8      0 B      0    127
TiB
default.rgw.buckets.index  27   256  1.4 MiB       11  1.4 MiB      0    6.2
TiB
default.rgw.buckets.data   28   256   79 MiB   10.00k   79 MiB      0    127
TiB

Regards,

Bailey

> -----Original Message-----
> From: Frank Schilder <frans@xxxxxx>
> Sent: December 6, 2023 5:22 AM
> To: ceph-users@xxxxxxx
> Subject:  ceph df reports incorrect stats
> 
> Dear fellow cephers,
> 
> we got a problem with ceph df: ceph df reports incorrect USED. It would be
> great if someone could look at this, if a ceph operator doesn't discover
this
> issue, they might run out of space without noticing.
> 
> This has been reported before but didn't get much attention:
> 
> https://www.spinics.net/lists/ceph-users/msg74602.html
> https://www.spinics.net/lists/ceph-users/msg74630.html
> 
> The symptom: STORED=USED in output of ceph df. All reports I know of are
> for octopus clusters, but I suspect newer versions are affected as well. I
don't
> have a reproducer yet (still lacking a test cluster).
> 
> Here is a correct usage report:
> 
> ==> logs/health_231203.log <==
> --- RAW STORAGE ---
> CLASS     SIZE     AVAIL    USED     RAW USED  %RAW USED
> hdd        13 PiB  7.8 PiB  4.8 PiB   4.8 PiB      38.29
> 
> --- POOLS ---
> POOL                   ID  PGS   STORED   OBJECTS  USED     %USED  MAX
AVAIL
> con-fs2-data           14  2048  1.1 PiB  402.93M  1.2 PiB  20.95    3.7
PiB
> con-fs2-data2          19  8192  2.7 PiB    1.10G  3.4 PiB  42.78    3.3
PiB
> 
> 
> Here is an incorrect one:
> 
> ==> logs/health_231204.log <==
> --- RAW STORAGE ---
> CLASS     SIZE     AVAIL    USED     RAW USED  %RAW USED
> hdd        13 PiB  7.8 PiB  4.8 PiB   4.8 PiB      38.06
> 
> --- POOLS ---
> POOL                   ID  PGS   STORED   OBJECTS  USED     %USED  MAX
AVAIL
> con-fs2-data           14  2048  1.1 PiB  402.93M  1.1 PiB  18.82    3.6
PiB
> con-fs2-data2          19  8192  2.7 PiB    1.10G  2.7 PiB  37.09    3.3
PiB
> 
> 
> That the first report is correct and not the second is supported by the
output
> of ceph osd df tree, showing a use of 4.6PB in alignment with the first
output
> of ceph df. Note that the date of the ceph osd df tree output is identical
to
> the date of the incorrect ceph df output, hence, ceph osd df tree is *not*
> affected by this issue:
> 
> ==> ceph osd df tree 231204 <===
> SIZE      RAW USE  DATA     OMAP     META     AVAIL    NAME
>   12 PiB  4.6 PiB  4.6 PiB  2.2 TiB   19 TiB  7.5 PiB  datacenter
ContainerSquare
>    0 B      0 B      0 B      0 B      0 B      0 B      room CON-161-A
>   12 PiB  4.6 PiB  4.6 PiB  2.2 TiB   19 TiB  7.5 PiB      room CON-161-A1
> 
> 
> In our case, the problem showed up out of nowhere. Here the log snippet
> for the time window within which the flip happened (compare the lines for
> con-fs2-data?-pools):
> 
> ==> logs/health_231203.log <==
> ceph status/df/pool stats/health detail at 16:30:03:
>   cluster:
>     health: HEALTH_OK
> 
>   services:
>     mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age
> 3M)
>     mgr: ceph-25(active, since 2M), standbys: ceph-26, ceph-01, ceph-03,
> ceph-02
>     mds: con-fs2:8 4 up:standby 8 up:active
>     osd: 1284 osds: 1279 up (since 14h), 1279 in (since 2w)
> 
>   task status:
> 
>   data:
>     pools:   14 pools, 25065 pgs
>     objects: 2.23G objects, 4.0 PiB
>     usage:   5.0 PiB used, 8.1 PiB / 13 PiB avail
>     pgs:     25035 active+clean
>              29    active+clean+scrubbing+deep
>              1     active+clean+scrubbing
> 
>   io:
>     client:   215 MiB/s rd, 140 MiB/s wr, 2.34k op/s rd, 1.89k op/s wr
> 
> --- RAW STORAGE ---
> CLASS     SIZE     AVAIL    USED     RAW USED  %RAW USED
> fs_meta    51 TiB   45 TiB  831 GiB   6.0 TiB      11.84
> hdd        13 PiB  7.8 PiB  4.8 PiB   4.8 PiB      38.08
> rbd_data  283 TiB  171 TiB  111 TiB   112 TiB      39.44
> rbd_perf   42 TiB   22 TiB   20 TiB    20 TiB      48.60
> TOTAL      13 PiB  8.1 PiB  4.9 PiB   5.0 PiB      38.04
> 
> --- POOLS ---
> POOL                   ID  PGS   STORED   OBJECTS  USED     %USED  MAX
AVAIL
> sr-rbd-meta-one         1   128   13 GiB   16.57k   38 GiB   0.03     39
TiB
> sr-rbd-data-one         2  4096  121 TiB   32.06M  108 TiB  48.08     88
TiB
> sr-rbd-one-stretch      3   160  262 GiB   68.81k  573 GiB   0.48     39
TiB
> con-rbd-meta-hpc-one    7    50   12 KiB       45  372 KiB      0    9.2
TiB
> con-rbd-data-hpc-one    8   150   24 GiB    6.10k   24 GiB      0    3.6
PiB
> sr-rbd-data-one-hdd    11  1024  137 TiB   35.95M  193 TiB  46.57    166
TiB
> con-fs2-meta1          12   512  554 GiB   76.76M  2.2 TiB   7.26    6.9
TiB
> con-fs2-meta2          13  4096      0 B  574.23M      0 B      0    6.9
TiB
> con-fs2-data           14  2048  1.1 PiB  402.93M  1.2 PiB  21.09    3.6
PiB
> con-fs2-data-ec-ssd    17   256  700 GiB    7.27M  706 GiB   2.44     22
TiB
> ms-rbd-one             18   256  805 GiB  210.92k  1.4 TiB   1.18     39
TiB
> con-fs2-data2          19  8192  2.7 PiB    1.10G  3.4 PiB  42.96    3.3
PiB
> sr-rbd-data-one-perf   20  4096  6.8 TiB    1.81M   20 TiB  57.09    5.1
TiB
> device_health_metrics  21     1  1.4 GiB    1.11k  4.2 GiB      0     39
TiB
> 
> 
> ceph status/df/pool stats/health detail at 16:30:10:
>   cluster:
>     health: HEALTH_OK
> 
>   services:
>     mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age
> 3M)
>     mgr: ceph-25(active, since 2M), standbys: ceph-26, ceph-01, ceph-03,
> ceph-02
>     mds: con-fs2:8 4 up:standby 8 up:active
>     osd: 1284 osds: 1279 up (since 14h), 1279 in (since 2w)
> 
>   task status:
> 
>   data:
>     pools:   14 pools, 25065 pgs
>     objects: 2.23G objects, 4.0 PiB
>     usage:   5.0 PiB used, 8.1 PiB / 13 PiB avail
>     pgs:     25035 active+clean
>              29    active+clean+scrubbing+deep
>              1     active+clean+scrubbing
> 
>   io:
>     client:   241 MiB/s rd, 174 MiB/s wr, 2.68k op/s rd, 2.34k op/s wr
> 
> --- RAW STORAGE ---
> CLASS     SIZE     AVAIL    USED     RAW USED  %RAW USED
> fs_meta    51 TiB   45 TiB  830 GiB   6.0 TiB      11.84
> hdd        13 PiB  7.8 PiB  4.8 PiB   4.8 PiB      38.08
> rbd_data  283 TiB  171 TiB  111 TiB   112 TiB      39.44
> rbd_perf   42 TiB   22 TiB   20 TiB    20 TiB      48.60
> TOTAL      13 PiB  8.1 PiB  4.9 PiB   5.0 PiB      38.04
> 
> --- POOLS ---
> POOL                   ID  PGS   STORED   OBJECTS  USED     %USED  MAX
AVAIL
> sr-rbd-meta-one         1   128   13 GiB   16.57k   13 GiB   0.01     39
TiB
> sr-rbd-data-one         2  4096   92 TiB   32.06M   92 TiB  44.11     88
TiB
> sr-rbd-one-stretch      3   160  222 GiB   68.81k  222 GiB   0.19     39
TiB
> con-rbd-meta-hpc-one    7    50  6.9 KiB       45  6.9 KiB      0    9.2
TiB
> con-rbd-data-hpc-one    8   150   23 GiB    6.10k   23 GiB      0    3.6
PiB
> sr-rbd-data-one-hdd    11  1024  135 TiB   35.95M  135 TiB  37.88    166
TiB
> con-fs2-meta1          12   512  367 GiB   76.76M  367 GiB   1.28    6.9
TiB
> con-fs2-meta2          13  4096      0 B  574.23M      0 B      0    6.9
TiB
> con-fs2-data           14  2048  1.1 PiB  402.93M  1.1 PiB  18.82    3.6
PiB
> con-fs2-data-ec-ssd    17   256  515 GiB    7.27M  515 GiB   1.79     22
TiB
> ms-rbd-one             18   256  579 GiB  210.92k  579 GiB   0.48     39
TiB
> con-fs2-data2          19  8192  2.7 PiB    1.10G  2.7 PiB  37.09    3.3
PiB
> sr-rbd-data-one-perf   20  4096  6.9 TiB    1.81M  6.9 TiB  31.29    5.1
TiB
> device_health_metrics  21     1  1.2 GiB    1.11k  1.2 GiB      0     39
TiB
> 
> For us, the issue disappeared after taking some OSDs in a second root
down.
> These OSDs were moved there for draining, we use a second crush root for
> this purpose. Here the log snippet with the time window within which the
> back-flip to correct reporting happened:
> 
> ==> logs/health_231205.log <==
> ceph status/df/pool stats/health detail at 17:42:58:
>   cluster:
>     health: HEALTH_WARN
>             1 osds down
>             24 hosts (12 osds) down
>             1 root (12 osds) down
> 
>   services:
>     mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age
> 3M)
>     mgr: ceph-25(active, since 2M), standbys: ceph-26, ceph-01, ceph-03,
> ceph-02
>     mds: con-fs2:8 4 up:standby 8 up:active
>     osd: 1284 osds: 1267 up (since 19m), 1268 in (since 0.401448s)
> 
>   task status:
> 
>   data:
>     pools:   14 pools, 25065 pgs
>     objects: 2.23G objects, 4.0 PiB
>     usage:   5.0 PiB used, 8.0 PiB / 13 PiB avail
>     pgs:     25034 active+clean
>              31    active+clean+scrubbing+deep
> 
>   io:
>     client:   118 MiB/s rd, 789 MiB/s wr, 1.75k op/s rd, 2.14k op/s wr
> 
> --- RAW STORAGE ---
> CLASS     SIZE     AVAIL    USED     RAW USED  %RAW USED
> fs_meta    51 TiB   45 TiB  731 GiB   5.9 TiB      11.65
> hdd        13 PiB  7.8 PiB  4.8 PiB   4.8 PiB      38.36
> rbd_data  283 TiB  171 TiB  111 TiB   112 TiB      39.59
> rbd_perf   42 TiB   22 TiB   20 TiB    20 TiB      48.19
> TOTAL      13 PiB  8.0 PiB  4.9 PiB   5.0 PiB      38.32
> 
> --- POOLS ---
> POOL                   ID  PGS   STORED   OBJECTS  USED     %USED  MAX
AVAIL
> sr-rbd-meta-one         1   128   14 GiB   16.94k   14 GiB   0.01     39
TiB
> sr-rbd-data-one         2  4096   93 TiB   32.32M   93 TiB  44.29     88
TiB
> sr-rbd-one-stretch      3   160  222 GiB   68.81k  222 GiB   0.19     39
TiB
> con-rbd-meta-hpc-one    7    50  6.9 KiB       45  6.9 KiB      0    9.2
TiB
> con-rbd-data-hpc-one    8   150   23 GiB    6.10k   23 GiB      0    3.6
PiB
> sr-rbd-data-one-hdd    11  1024  135 TiB   36.08M  135 TiB  38.00    165
TiB
> con-fs2-meta1          12   512  367 GiB   76.81M  367 GiB   1.28    6.9
TiB
> con-fs2-meta2          13  4096      0 B  572.65M      0 B      0    6.9
TiB
> con-fs2-data           14  2048  1.1 PiB  402.93M  1.1 PiB  18.83    3.6
PiB
> con-fs2-data-ec-ssd    17   256  515 GiB    7.27M  515 GiB   1.78     22
TiB
> ms-rbd-one             18   256  579 GiB  210.92k  579 GiB   0.48     39
TiB
> con-fs2-data2          19  8192  2.7 PiB    1.10G  2.7 PiB  37.16    3.3
PiB
> sr-rbd-data-one-perf   20  4096  6.9 TiB    1.81M  6.9 TiB  31.07    5.1
TiB
> device_health_metrics  21     1  1.2 GiB    1.11k  1.2 GiB      0     39
TiB
> 
> 
> ceph status/df/pool stats/health detail at 17:43:04:
>   cluster:
>     health: HEALTH_OK
> 
>   services:
>     mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age
> 3M)
>     mgr: ceph-25(active, since 2M), standbys: ceph-26, ceph-01, ceph-03,
> ceph-02
>     mds: con-fs2:8 4 up:standby 8 up:active
>     osd: 1284 osds: 1267 up (since 19m), 1267 in (since 6s)
> 
>   task status:
> 
>   data:
>     pools:   14 pools, 25065 pgs
>     objects: 2.23G objects, 4.0 PiB
>     usage:   5.0 PiB used, 8.0 PiB / 13 PiB avail
>     pgs:     25035 active+clean
>              30    active+clean+scrubbing+deep
> 
>   io:
>     client:   151 MiB/s rd, 840 MiB/s wr, 2.13k op/s rd, 2.10k op/s wr
> 
> --- RAW STORAGE ---
> CLASS     SIZE     AVAIL    USED     RAW USED  %RAW USED
> fs_meta    51 TiB   45 TiB  731 GiB   5.9 TiB      11.65
> hdd        13 PiB  7.7 PiB  4.8 PiB   4.8 PiB      38.42
> rbd_data  283 TiB  171 TiB  111 TiB   112 TiB      39.59
> rbd_perf   42 TiB   22 TiB   20 TiB    20 TiB      48.19
> TOTAL      13 PiB  8.0 PiB  4.9 PiB   5.0 PiB      38.37
> 
> --- POOLS ---
> POOL                   ID  PGS   STORED   OBJECTS  USED     %USED  MAX
AVAIL
> sr-rbd-meta-one         1   128   14 GiB   16.94k   42 GiB   0.04     39
TiB
> sr-rbd-data-one         2  4096  122 TiB   32.32M  109 TiB  48.26     88
TiB
> sr-rbd-one-stretch      3   160  262 GiB   68.81k  573 GiB   0.48     39
TiB
> con-rbd-meta-hpc-one    7    50   11 KiB       45  368 KiB      0    9.2
TiB
> con-rbd-data-hpc-one    8   150   24 GiB    6.10k   24 GiB      0    3.6
PiB
> sr-rbd-data-one-hdd    11  1024  138 TiB   36.08M  193 TiB  46.69    165
TiB
> con-fs2-meta1          12   512  555 GiB   76.81M  2.2 TiB   7.26    6.9
TiB
> con-fs2-meta2          13  4096      0 B  572.65M      0 B      0    6.9
TiB
> con-fs2-data           14  2048  1.1 PiB  402.93M  1.2 PiB  21.09    3.6
PiB
> con-fs2-data-ec-ssd    17   256  700 GiB    7.27M  706 GiB   2.43     22
TiB
> ms-rbd-one             18   256  805 GiB  210.92k  1.4 TiB   1.18     39
TiB
> con-fs2-data2          19  8192  2.7 PiB    1.10G  3.4 PiB  43.01    3.3
PiB
> sr-rbd-data-one-perf   20  4096  6.8 TiB    1.81M   20 TiB  56.75    5.1
TiB
> device_health_metrics  21     1  1.4 GiB    1.11k  4.2 GiB      0     39
TiB
> 
> This observation leads me to suspect that having multiple crush roots
might
> be a reason for this observation. Our crush tree looks like this (OSDs
> removed), it has 3 different roots (BB, DTU and default):
> 
> ID    CLASS     WEIGHT       TYPE NAME                           STATUS
REWEIGHT  PRI-
> AFF
>  -78              106.92188  root BB
>  -99                      0      host bb-04
> -102                      0      host bb-05
> -105                      0      host bb-06
> -325                      0      host bb-06-old
> -108                      0      host bb-07
> -331                      0      host bb-07-old
>   -3                8.91016      host bb-08
>   -9                8.91016      host bb-09
>  -18                8.91016      host bb-10
>  -21                8.91016      host bb-11
>  -28                8.91016      host bb-12
>  -34                8.91016      host bb-13
>  -72                8.91016      host bb-14
>  -75                8.91016      host bb-15
> -111                8.91016      host bb-16
> -114                8.91016      host bb-17
> -117                      0      host bb-18
> -142                      0      host bb-19
> -145                      0      host bb-20
> -241                      0      host bb-21
> -246                      0      host bb-22
> -251                8.91016      host bb-23
> -256                8.91016      host bb-24
> -151                      0      host bb-office
>  -40            14614.77832  root DTU
>  -42                      0      region Lyngby
>  -41            14614.77832      region Risoe
>  -50            12843.79590          datacenter ContainerSquare
>  -56                      0              room CON-161-A
>  -57            12843.79590              room CON-161-A1
>  -11             1092.49060                  host ceph-08
>  -13             1074.27673                  host ceph-09
>  -23             1075.67920                  host ceph-10
>  -15             1067.16492                  host ceph-11
>  -25             1080.21912                  host ceph-12
>  -83             1061.17480                  host ceph-13
>  -85             1047.70276                  host ceph-14
>  -87             1079.02820                  host ceph-15
> -136             1012.55048                  host ceph-16
> -139             1073.61475                  host ceph-17
> -261             1125.57202                  host ceph-23
> -262             1054.32227                  host ceph-24
> -148              885.49133          datacenter MultiSite
>  -65               86.16304              host ceph-04
>  -67              101.50623              host ceph-05
>  -69              104.85805              host ceph-06
>  -71               96.39923              host ceph-07
>  -81               97.54230              host ceph-18
>  -94               98.48271              host ceph-19
>   -4               97.20181              host ceph-20
>  -64               99.77657              host ceph-21
>  -66              103.56137              host ceph-22
>  -49              885.49133          datacenter ServerRoom
>  -55              885.49133              room SR-113
>  -65               86.16304                  host ceph-04
>  -67              101.50623                  host ceph-05
>  -69              104.85805                  host ceph-06
>  -71               96.39923                  host ceph-07
>  -81               97.54230                  host ceph-18
>  -94               98.48271                  host ceph-19
>   -4               97.20181                  host ceph-20
>  -64               99.77657                  host ceph-21
>  -66              103.56137                  host ceph-22
>   -1                      0  root default
> 
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email
> to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx