Hi everyone,
we upgraded our production cluster just recently to version:
ceph version 14.2.3-349-g7b1552ea82
(7b1552ea827cf5167b6edbba96dd1c4a9dc16937) nautilus (stable)
We then activated pg_autoscaler for two pools that had a bad pg_num
and the result is satisfying.
However, after the rebalance finished the cluster became laggy. We
noticed that two out of three MONs had a much higher CPU usage than
usual, according to `top` the MON processes consumed more than 100%.
Restarting the MON services and disabling pg_autoscaler resolved the
issue. I've read that the balancer module can cause a higher load on
the MGR daemon, is this somehow related?
Another thing to mention is the confusing calculation of the
autoscaler. After the pg numbers had been corrected we got the warning
about overcommitted pools:
1 subtrees have overcommitted pool target_size_bytes
1 subtrees have overcommitted pool target_size_ratio
The images pool was responsible for that. The confusing part was that
sometimes autoscale-status displayed the size of that pool with more
than 14 TB:
ceph osd pool autoscale-status
POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO
TARGET RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE
images 14399G 3.0 33713G 1.2813
1.0 128 on
And a couple of minutes later the pool suddenly only had around 4 TB of data:
ceph osd pool autoscale-status
POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO
TARGET RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE
images 4112G 3.0 33713G 0.3659
1.0 128 on
There seems to be some kind of inconsistency here. The actual used
storage of this pool according to `ceph df` is:
POOLS:
POOL ID STORED OBJECTS USED
%USED MAX AVAIL
images 1 4.1 TiB 1.01M 12 TiB
49.73 4.1 TiB
Has anyone experienced something similar? Are these known issues?
Regards,
Eugen
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx