Dear fellow cephers, we got a problem with ceph df: ceph df reports incorrect USED. It would be great if someone could look at this, if a ceph operator doesn't discover this issue, they might run out of space without noticing. This has been reported before but didn't get much attention: https://www.spinics.net/lists/ceph-users/msg74602.html https://www.spinics.net/lists/ceph-users/msg74630.html The symptom: STORED=USED in output of ceph df. All reports I know of are for octopus clusters, but I suspect newer versions are affected as well. I don't have a reproducer yet (still lacking a test cluster). Here is a correct usage report: ==> logs/health_231203.log <== --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 13 PiB 7.8 PiB 4.8 PiB 4.8 PiB 38.29 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL con-fs2-data 14 2048 1.1 PiB 402.93M 1.2 PiB 20.95 3.7 PiB con-fs2-data2 19 8192 2.7 PiB 1.10G 3.4 PiB 42.78 3.3 PiB Here is an incorrect one: ==> logs/health_231204.log <== --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 13 PiB 7.8 PiB 4.8 PiB 4.8 PiB 38.06 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL con-fs2-data 14 2048 1.1 PiB 402.93M 1.1 PiB 18.82 3.6 PiB con-fs2-data2 19 8192 2.7 PiB 1.10G 2.7 PiB 37.09 3.3 PiB That the first report is correct and not the second is supported by the output of ceph osd df tree, showing a use of 4.6PB in alignment with the first output of ceph df. Note that the date of the ceph osd df tree output is identical to the date of the incorrect ceph df output, hence, ceph osd df tree is *not* affected by this issue: ==> ceph osd df tree 231204 <=== SIZE RAW USE DATA OMAP META AVAIL NAME 12 PiB 4.6 PiB 4.6 PiB 2.2 TiB 19 TiB 7.5 PiB datacenter ContainerSquare 0 B 0 B 0 B 0 B 0 B 0 B room CON-161-A 12 PiB 4.6 PiB 4.6 PiB 2.2 TiB 19 TiB 7.5 PiB room CON-161-A1 In our case, the problem showed up out of nowhere. Here the log snippet for the time window within which the flip happened (compare the lines for con-fs2-data?-pools): ==> logs/health_231203.log <== ceph status/df/pool stats/health detail at 16:30:03: cluster: health: HEALTH_OK services: mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 3M) mgr: ceph-25(active, since 2M), standbys: ceph-26, ceph-01, ceph-03, ceph-02 mds: con-fs2:8 4 up:standby 8 up:active osd: 1284 osds: 1279 up (since 14h), 1279 in (since 2w) task status: data: pools: 14 pools, 25065 pgs objects: 2.23G objects, 4.0 PiB usage: 5.0 PiB used, 8.1 PiB / 13 PiB avail pgs: 25035 active+clean 29 active+clean+scrubbing+deep 1 active+clean+scrubbing io: client: 215 MiB/s rd, 140 MiB/s wr, 2.34k op/s rd, 1.89k op/s wr --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED fs_meta 51 TiB 45 TiB 831 GiB 6.0 TiB 11.84 hdd 13 PiB 7.8 PiB 4.8 PiB 4.8 PiB 38.08 rbd_data 283 TiB 171 TiB 111 TiB 112 TiB 39.44 rbd_perf 42 TiB 22 TiB 20 TiB 20 TiB 48.60 TOTAL 13 PiB 8.1 PiB 4.9 PiB 5.0 PiB 38.04 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL sr-rbd-meta-one 1 128 13 GiB 16.57k 38 GiB 0.03 39 TiB sr-rbd-data-one 2 4096 121 TiB 32.06M 108 TiB 48.08 88 TiB sr-rbd-one-stretch 3 160 262 GiB 68.81k 573 GiB 0.48 39 TiB con-rbd-meta-hpc-one 7 50 12 KiB 45 372 KiB 0 9.2 TiB con-rbd-data-hpc-one 8 150 24 GiB 6.10k 24 GiB 0 3.6 PiB sr-rbd-data-one-hdd 11 1024 137 TiB 35.95M 193 TiB 46.57 166 TiB con-fs2-meta1 12 512 554 GiB 76.76M 2.2 TiB 7.26 6.9 TiB con-fs2-meta2 13 4096 0 B 574.23M 0 B 0 6.9 TiB con-fs2-data 14 2048 1.1 PiB 402.93M 1.2 PiB 21.09 3.6 PiB con-fs2-data-ec-ssd 17 256 700 GiB 7.27M 706 GiB 2.44 22 TiB ms-rbd-one 18 256 805 GiB 210.92k 1.4 TiB 1.18 39 TiB con-fs2-data2 19 8192 2.7 PiB 1.10G 3.4 PiB 42.96 3.3 PiB sr-rbd-data-one-perf 20 4096 6.8 TiB 1.81M 20 TiB 57.09 5.1 TiB device_health_metrics 21 1 1.4 GiB 1.11k 4.2 GiB 0 39 TiB ceph status/df/pool stats/health detail at 16:30:10: cluster: health: HEALTH_OK services: mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 3M) mgr: ceph-25(active, since 2M), standbys: ceph-26, ceph-01, ceph-03, ceph-02 mds: con-fs2:8 4 up:standby 8 up:active osd: 1284 osds: 1279 up (since 14h), 1279 in (since 2w) task status: data: pools: 14 pools, 25065 pgs objects: 2.23G objects, 4.0 PiB usage: 5.0 PiB used, 8.1 PiB / 13 PiB avail pgs: 25035 active+clean 29 active+clean+scrubbing+deep 1 active+clean+scrubbing io: client: 241 MiB/s rd, 174 MiB/s wr, 2.68k op/s rd, 2.34k op/s wr --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED fs_meta 51 TiB 45 TiB 830 GiB 6.0 TiB 11.84 hdd 13 PiB 7.8 PiB 4.8 PiB 4.8 PiB 38.08 rbd_data 283 TiB 171 TiB 111 TiB 112 TiB 39.44 rbd_perf 42 TiB 22 TiB 20 TiB 20 TiB 48.60 TOTAL 13 PiB 8.1 PiB 4.9 PiB 5.0 PiB 38.04 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL sr-rbd-meta-one 1 128 13 GiB 16.57k 13 GiB 0.01 39 TiB sr-rbd-data-one 2 4096 92 TiB 32.06M 92 TiB 44.11 88 TiB sr-rbd-one-stretch 3 160 222 GiB 68.81k 222 GiB 0.19 39 TiB con-rbd-meta-hpc-one 7 50 6.9 KiB 45 6.9 KiB 0 9.2 TiB con-rbd-data-hpc-one 8 150 23 GiB 6.10k 23 GiB 0 3.6 PiB sr-rbd-data-one-hdd 11 1024 135 TiB 35.95M 135 TiB 37.88 166 TiB con-fs2-meta1 12 512 367 GiB 76.76M 367 GiB 1.28 6.9 TiB con-fs2-meta2 13 4096 0 B 574.23M 0 B 0 6.9 TiB con-fs2-data 14 2048 1.1 PiB 402.93M 1.1 PiB 18.82 3.6 PiB con-fs2-data-ec-ssd 17 256 515 GiB 7.27M 515 GiB 1.79 22 TiB ms-rbd-one 18 256 579 GiB 210.92k 579 GiB 0.48 39 TiB con-fs2-data2 19 8192 2.7 PiB 1.10G 2.7 PiB 37.09 3.3 PiB sr-rbd-data-one-perf 20 4096 6.9 TiB 1.81M 6.9 TiB 31.29 5.1 TiB device_health_metrics 21 1 1.2 GiB 1.11k 1.2 GiB 0 39 TiB For us, the issue disappeared after taking some OSDs in a second root down. These OSDs were moved there for draining, we use a second crush root for this purpose. Here the log snippet with the time window within which the back-flip to correct reporting happened: ==> logs/health_231205.log <== ceph status/df/pool stats/health detail at 17:42:58: cluster: health: HEALTH_WARN 1 osds down 24 hosts (12 osds) down 1 root (12 osds) down services: mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 3M) mgr: ceph-25(active, since 2M), standbys: ceph-26, ceph-01, ceph-03, ceph-02 mds: con-fs2:8 4 up:standby 8 up:active osd: 1284 osds: 1267 up (since 19m), 1268 in (since 0.401448s) task status: data: pools: 14 pools, 25065 pgs objects: 2.23G objects, 4.0 PiB usage: 5.0 PiB used, 8.0 PiB / 13 PiB avail pgs: 25034 active+clean 31 active+clean+scrubbing+deep io: client: 118 MiB/s rd, 789 MiB/s wr, 1.75k op/s rd, 2.14k op/s wr --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED fs_meta 51 TiB 45 TiB 731 GiB 5.9 TiB 11.65 hdd 13 PiB 7.8 PiB 4.8 PiB 4.8 PiB 38.36 rbd_data 283 TiB 171 TiB 111 TiB 112 TiB 39.59 rbd_perf 42 TiB 22 TiB 20 TiB 20 TiB 48.19 TOTAL 13 PiB 8.0 PiB 4.9 PiB 5.0 PiB 38.32 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL sr-rbd-meta-one 1 128 14 GiB 16.94k 14 GiB 0.01 39 TiB sr-rbd-data-one 2 4096 93 TiB 32.32M 93 TiB 44.29 88 TiB sr-rbd-one-stretch 3 160 222 GiB 68.81k 222 GiB 0.19 39 TiB con-rbd-meta-hpc-one 7 50 6.9 KiB 45 6.9 KiB 0 9.2 TiB con-rbd-data-hpc-one 8 150 23 GiB 6.10k 23 GiB 0 3.6 PiB sr-rbd-data-one-hdd 11 1024 135 TiB 36.08M 135 TiB 38.00 165 TiB con-fs2-meta1 12 512 367 GiB 76.81M 367 GiB 1.28 6.9 TiB con-fs2-meta2 13 4096 0 B 572.65M 0 B 0 6.9 TiB con-fs2-data 14 2048 1.1 PiB 402.93M 1.1 PiB 18.83 3.6 PiB con-fs2-data-ec-ssd 17 256 515 GiB 7.27M 515 GiB 1.78 22 TiB ms-rbd-one 18 256 579 GiB 210.92k 579 GiB 0.48 39 TiB con-fs2-data2 19 8192 2.7 PiB 1.10G 2.7 PiB 37.16 3.3 PiB sr-rbd-data-one-perf 20 4096 6.9 TiB 1.81M 6.9 TiB 31.07 5.1 TiB device_health_metrics 21 1 1.2 GiB 1.11k 1.2 GiB 0 39 TiB ceph status/df/pool stats/health detail at 17:43:04: cluster: health: HEALTH_OK services: mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 3M) mgr: ceph-25(active, since 2M), standbys: ceph-26, ceph-01, ceph-03, ceph-02 mds: con-fs2:8 4 up:standby 8 up:active osd: 1284 osds: 1267 up (since 19m), 1267 in (since 6s) task status: data: pools: 14 pools, 25065 pgs objects: 2.23G objects, 4.0 PiB usage: 5.0 PiB used, 8.0 PiB / 13 PiB avail pgs: 25035 active+clean 30 active+clean+scrubbing+deep io: client: 151 MiB/s rd, 840 MiB/s wr, 2.13k op/s rd, 2.10k op/s wr --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED fs_meta 51 TiB 45 TiB 731 GiB 5.9 TiB 11.65 hdd 13 PiB 7.7 PiB 4.8 PiB 4.8 PiB 38.42 rbd_data 283 TiB 171 TiB 111 TiB 112 TiB 39.59 rbd_perf 42 TiB 22 TiB 20 TiB 20 TiB 48.19 TOTAL 13 PiB 8.0 PiB 4.9 PiB 5.0 PiB 38.37 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL sr-rbd-meta-one 1 128 14 GiB 16.94k 42 GiB 0.04 39 TiB sr-rbd-data-one 2 4096 122 TiB 32.32M 109 TiB 48.26 88 TiB sr-rbd-one-stretch 3 160 262 GiB 68.81k 573 GiB 0.48 39 TiB con-rbd-meta-hpc-one 7 50 11 KiB 45 368 KiB 0 9.2 TiB con-rbd-data-hpc-one 8 150 24 GiB 6.10k 24 GiB 0 3.6 PiB sr-rbd-data-one-hdd 11 1024 138 TiB 36.08M 193 TiB 46.69 165 TiB con-fs2-meta1 12 512 555 GiB 76.81M 2.2 TiB 7.26 6.9 TiB con-fs2-meta2 13 4096 0 B 572.65M 0 B 0 6.9 TiB con-fs2-data 14 2048 1.1 PiB 402.93M 1.2 PiB 21.09 3.6 PiB con-fs2-data-ec-ssd 17 256 700 GiB 7.27M 706 GiB 2.43 22 TiB ms-rbd-one 18 256 805 GiB 210.92k 1.4 TiB 1.18 39 TiB con-fs2-data2 19 8192 2.7 PiB 1.10G 3.4 PiB 43.01 3.3 PiB sr-rbd-data-one-perf 20 4096 6.8 TiB 1.81M 20 TiB 56.75 5.1 TiB device_health_metrics 21 1 1.4 GiB 1.11k 4.2 GiB 0 39 TiB This observation leads me to suspect that having multiple crush roots might be a reason for this observation. Our crush tree looks like this (OSDs removed), it has 3 different roots (BB, DTU and default): ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -78 106.92188 root BB -99 0 host bb-04 -102 0 host bb-05 -105 0 host bb-06 -325 0 host bb-06-old -108 0 host bb-07 -331 0 host bb-07-old -3 8.91016 host bb-08 -9 8.91016 host bb-09 -18 8.91016 host bb-10 -21 8.91016 host bb-11 -28 8.91016 host bb-12 -34 8.91016 host bb-13 -72 8.91016 host bb-14 -75 8.91016 host bb-15 -111 8.91016 host bb-16 -114 8.91016 host bb-17 -117 0 host bb-18 -142 0 host bb-19 -145 0 host bb-20 -241 0 host bb-21 -246 0 host bb-22 -251 8.91016 host bb-23 -256 8.91016 host bb-24 -151 0 host bb-office -40 14614.77832 root DTU -42 0 region Lyngby -41 14614.77832 region Risoe -50 12843.79590 datacenter ContainerSquare -56 0 room CON-161-A -57 12843.79590 room CON-161-A1 -11 1092.49060 host ceph-08 -13 1074.27673 host ceph-09 -23 1075.67920 host ceph-10 -15 1067.16492 host ceph-11 -25 1080.21912 host ceph-12 -83 1061.17480 host ceph-13 -85 1047.70276 host ceph-14 -87 1079.02820 host ceph-15 -136 1012.55048 host ceph-16 -139 1073.61475 host ceph-17 -261 1125.57202 host ceph-23 -262 1054.32227 host ceph-24 -148 885.49133 datacenter MultiSite -65 86.16304 host ceph-04 -67 101.50623 host ceph-05 -69 104.85805 host ceph-06 -71 96.39923 host ceph-07 -81 97.54230 host ceph-18 -94 98.48271 host ceph-19 -4 97.20181 host ceph-20 -64 99.77657 host ceph-21 -66 103.56137 host ceph-22 -49 885.49133 datacenter ServerRoom -55 885.49133 room SR-113 -65 86.16304 host ceph-04 -67 101.50623 host ceph-05 -69 104.85805 host ceph-06 -71 96.39923 host ceph-07 -81 97.54230 host ceph-18 -94 98.48271 host ceph-19 -4 97.20181 host ceph-20 -64 99.77657 host ceph-21 -66 103.56137 host ceph-22 -1 0 root default Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx