Large strange flip in storage accounting

Frank Schilder <frans@xxxxxx> · Wed, 9 Nov 2022 11:44:45 +0000

Hi all,

during maintenance yesterday we observed something extremely strange on our production cluster. We needed to rebalance storage from slow to fast SSDs in small pools. The pools affected by this operation were con-rbd-meta-hpc-one, con-fs2-meta1 and con-fs2-meta2 (see ceph df output below). We changed the device class in the crush rule to move the data and something very strange happened (first instance). In addition to the pools that we were moving to different disks, also pool sr-rbd-data-one-perf in a completely different sub-tree on a completely different device class showed 3 remapped PGs. I don't know how this is even possible, but, well.

While editing the crush rule we had norebalance set and let peering finish before data movement. We also wanted to check the new mappings before letting data move. After un-setting norebalance the 3 PGs on sr-rbd-data-one-perf became clean in what seemed to be an instant. In addition to that, something very strange happened again (second instance). The output of ceph df changed immediately and completely.

Before starting the data movement it would look like this:

--- RAW STORAGE ---
CLASS     SIZE     AVAIL    USED     RAW USED  %RAW USED
hdd        11 PiB  8.2 PiB  3.0 PiB   3.0 PiB      27.16
rbd_data  262 TiB  154 TiB  105 TiB   108 TiB      41.23
rbd_perf   31 TiB   19 TiB   12 TiB    12 TiB      38.93
ssd       8.4 TiB  7.1 TiB   15 GiB   1.3 TiB      15.09
TOTAL      12 PiB  8.3 PiB  3.1 PiB   3.2 PiB      27.50

--- POOLS ---
POOL                   ID  PGS   STORED   OBJECTS  USED     %USED  MAX AVAIL
sr-rbd-meta-one         1   128  7.9 GiB   12.94k  7.9 GiB   0.01     25 TiB
sr-rbd-data-one         2  2048   75 TiB   26.83M   75 TiB  43.67     72 TiB
sr-rbd-one-stretch      3   160  222 GiB   68.81k  222 GiB   0.29     25 TiB
con-rbd-meta-hpc-one    7    50   54 KiB       61   54 KiB      0     12 TiB
con-rbd-data-hpc-one    8   150   36 GiB    9.42k   36 GiB      0    5.0 PiB
sr-rbd-data-one-hdd    11   560  126 TiB   33.52M  126 TiB  36.82    162 TiB
con-fs2-meta1          12   256  241 GiB   40.74M  241 GiB   0.65    9.0 TiB
con-fs2-meta2          13  1024      0 B  362.90M      0 B      0    9.0 TiB
con-fs2-data           14  1350  1.1 PiB  407.17M  1.1 PiB  14.42    5.0 PiB
con-fs2-data-ec-ssd    17   128  386 GiB    6.57M  386 GiB   1.03     29 TiB
ms-rbd-one             18   256  416 GiB  166.81k  416 GiB   0.54     25 TiB
con-fs2-data2          19  8192  1.3 PiB  534.20M  1.3 PiB  16.80    4.6 PiB
sr-rbd-data-one-perf   20  4096  4.3 TiB    1.13M  4.3 TiB  19.96    5.8 TiB
device_health_metrics  21     1  196 MiB      994  196 MiB      0     25 TiB

Immediately after starting the data movement (well, after the 3 strange PGs were clean) it started looking like this:

--- RAW STORAGE ---
CLASS     SIZE     AVAIL    USED     RAW USED  %RAW USED
fs_meta   8.7 TiB  8.7 TiB  3.8 GiB    62 GiB       0.69
hdd        11 PiB  8.2 PiB  3.0 PiB   3.0 PiB      27.19
rbd_data  262 TiB  154 TiB  105 TiB   108 TiB      41.30
rbd_perf   31 TiB   19 TiB   12 TiB    12 TiB      38.94
TOTAL      12 PiB  8.3 PiB  3.2 PiB   3.2 PiB      27.51

--- POOLS ---
POOL                   ID  PGS   STORED   OBJECTS  USED     %USED  MAX AVAIL
sr-rbd-meta-one         1   128  8.1 GiB   12.94k   20 GiB   0.03     26 TiB
sr-rbd-data-one         2  2048  103 TiB   26.96M  102 TiB  50.97     74 TiB
sr-rbd-one-stretch      3   160  262 GiB   68.81k  614 GiB   0.78     26 TiB
con-rbd-meta-hpc-one    7    50  4.6 MiB       61   14 MiB      0    2.7 TiB
con-rbd-data-hpc-one    8   150   36 GiB    9.42k   41 GiB      0    5.0 PiB
sr-rbd-data-one-hdd    11   560  131 TiB   33.69M  214 TiB  49.99    161 TiB
con-fs2-meta1          12   256  421 GiB   40.74M  1.6 TiB  16.78    2.0 TiB
con-fs2-meta2          13  1024      0 B  362.89M      0 B      0    2.0 TiB
con-fs2-data           14  1350  1.1 PiB  407.17M  1.2 PiB  16.27    5.0 PiB
con-fs2-data-ec-ssd    17   128  564 GiB    6.57M  588 GiB   1.57     29 TiB
ms-rbd-one             18   256  637 GiB  166.82k  1.2 TiB   1.60     26 TiB
con-fs2-data2          19  8192  1.3 PiB  534.21M  1.6 PiB  20.37    4.6 PiB
sr-rbd-data-one-perf   20  4096  4.3 TiB    1.13M   12 TiB  41.41    5.7 TiB
device_health_metrics  21     1  207 MiB      994  620 MiB      0     26 TiB

The columns stored and used show completely different numbers now. In fact, I believe the new numbers are correct, because they match much better with the %used of the fullest OSD in the respective pools and the column used reflects the replication factor correctly.

This cluster was upgraded recently from mimic to octopus. Any idea what could have triggered this change in accounting and what numbers I should believe in?

Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx