Hi David, I seem to observe the same: https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/QIMBPA5VDRJEUEV62SAD6UCE4QPV4GTY/#QIMBPA5VDRJEUEV62SAD6UCE4QPV4GTY Ceph df was reporting correctly for a while but flipped back to stored=used at some point. Today it was showing again: --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED fs_meta 8.7 TiB 8.6 TiB 11 GiB 99 GiB 1.11 hdd 11 PiB 8.2 PiB 3.1 PiB 3.1 PiB 27.24 rbd_data 262 TiB 153 TiB 106 TiB 109 TiB 41.69 rbd_perf 31 TiB 19 TiB 12 TiB 12 TiB 39.04 TOTAL 12 PiB 8.4 PiB 3.2 PiB 3.2 PiB 27.58 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL sr-rbd-meta-one 1 128 7.9 GiB 12.95k 7.9 GiB 0.01 25 TiB sr-rbd-data-one 2 2048 76 TiB 27.15M 76 TiB 43.88 73 TiB sr-rbd-one-stretch 3 160 222 GiB 68.81k 222 GiB 0.29 25 TiB con-rbd-meta-hpc-one 7 50 54 KiB 61 54 KiB 0 12 TiB con-rbd-data-hpc-one 8 150 36 GiB 9.42k 36 GiB 0 5.0 PiB sr-rbd-data-one-hdd 11 560 127 TiB 33.84M 127 TiB 36.98 162 TiB con-fs2-meta1 12 256 240 GiB 40.80M 240 GiB 0.65 9.0 TiB con-fs2-meta2 13 1024 0 B 363.33M 0 B 0 9.0 TiB con-fs2-data 14 1350 1.1 PiB 407.07M 1.1 PiB 14.45 5.0 PiB con-fs2-data-ec-ssd 17 128 386 GiB 6.57M 386 GiB 1.03 29 TiB ms-rbd-one 18 256 417 GiB 166.89k 417 GiB 0.53 25 TiB con-fs2-data2 19 8192 1.3 PiB 537.38M 1.3 PiB 16.96 4.6 PiB sr-rbd-data-one-perf 20 4096 4.3 TiB 1.13M 4.3 TiB 20.04 5.7 TiB device_health_metrics 21 1 196 MiB 995 196 MiB 0 25 TiB I do not even believe that stored is correct everywhere, the numbers are very different in the other form of report. This is really irritating. I think you should file a bug report. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: David Alfano <dalfano@xxxxxxxxx> Sent: 26 May 2022 21:47:20 To: ceph-users@xxxxxxx Subject: ceph df reporting incorrect used space after pg reduction Howdy Ceph-Users! Over the past few days, I've noticed an interesting behavior in 15.2.15 that I'm curious if anyone else can reproduce. After setting up a few pools and running some load against them, I lowered the number of pgs in the TestA pool from 4096 to 1024. To track the process of the merging pgs, I stuck a watch on `ceph df` and let it run. No i/o was happening on the cluster during the decrease of pgs. After a few hours, I came back to see that the process had completed, but now `ceph df` was reporting far different usage values than what I had started with. ============================================================================ Thu May 26 16:39:55 UTC 2022 TestA 10 1412 30 GiB 1.59k 37 GiB 0 20 PiB TestB 11 256 537 KiB 1 1.6 MiB 0 8.0 PiB test1 12 32 58 GiB 3.38k 174 GiB 0 8.0 PiB test2 13 64 916 KiB 5 3.2 MiB 0 8.0 PiB ============================================================================ ============================================================================ Thu May 26 16:40:05 UTC 2022 TestA 10 1409 30 GiB 1.59k 37 GiB 0 20 PiB TestB 11 256 537 KiB 1 1.6 MiB 0 8.0 PiB test1 12 32 58 GiB 3.38k 174 GiB 0 8.0 PiB test2 13 64 916 KiB 5 3.2 MiB 0 8.0 PiB ============================================================================ ============================================================================ Thu May 26 16:40:16 UTC 2022 TestA 10 1407 30 GiB 1.59k 30 GiB 0 20 PiB TestB 11 256 0 B 1 0 B 0 8.0 PiB test1 12 32 58 GiB 3.38k 58 GiB 0 8.0 PiB test2 13 64 3.8 KiB 5 3.8 KiB 0 8.0 PiB ============================================================================ snippet from a ceph df taken during the merge process Pool info TestA is a 10/2 EC pool TestB is a 3x replicated pool for metadata test1 is a 3x replicated pool for data test2 is a 3x replicated pool for metadata TestA Erasure-code-profile root@Pikachu:~# ceph osd erasure-code-profile get TestA_ec_profile crush-device-class=hdd crush-failure-domain=host crush-root=default jerasure-per-chunk-alignment=false k=10 m=2 plugin=jerasure technique=reed_sol_van w=8 root@Pikachu:~# ceph versions { "mon": { "ceph version 15.2.15 (2dfb18841cfecc2f7eb7eb2afd65986ca4d95985) octopus (stable)": 3 }, "mgr": { "ceph version 15.2.15 (2dfb18841cfecc2f7eb7eb2afd65986ca4d95985) octopus (stable)": 3 }, "osd": { "ceph version 15.2.15 (2dfb18841cfecc2f7eb7eb2afd65986ca4d95985) octopus (stable)": 2267 }, "mds": {}, "rgw": { "ceph version 15.2.15 (2dfb18841cfecc2f7eb7eb2afd65986ca4d95985) octopus (stable)": 63 }, "overall": { "ceph version 15.2.15 (2dfb18841cfecc2f7eb7eb2afd65986ca4d95985) octopus (stable)": 2336 } } I've confirmed that the objects and their copies still exist within the cluster, which makes me believe this is purely a reporting issue. If I had to guess, somehow the variable for USED space is being set to the variable used for STORED data. I've been able to reproduce the behavior consistently with the following process: - Create several pools, with one being EC 10/2 - Set the EC 10/2 pool pg_num and pgp_num 4096 pgs - Put data into all pools - Lower the EC 10/2 pool's pg_num and pgp_num to 1024 - Around when the EC 10/2 pool has around 1400 pgs, ceph df will report differently As a workaround: to get `ceph df` to report the correct information, all that is needed is to increase the pg_num and pgp_num of any of the three other pools. Has anyone else noticed this behavior? Should I file a bug report or is this already known? Respectfully, David _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx