ceph rebalance behavior

"Chu, Vincent" <vchu@xxxxxxxx> · Thu, 30 Sep 2021 15:46:43 +0000

So we were forced out of our datacenter and had to move all our osd nodes to new racks. Accordingly, we changed the crush map to reflect our OSD nodes' new rack positions and that triggered a huge rebalance.

We're now getting OSD nearfull warnings on OSDs across all the racks. Started off with 1 nearfull, up to 5 now. OSDs within the same OSD node have a wide variance of capacity used. Within the same node there are OSDs which are 85% full and 49% full. We tried ceph osd reweight-by-utilization, but it didn't appear to do anything, with the nearfull OSDs still filling up. We have observed that the utilization of nearfull OSDs varies and will go up and back down.

There is also one backfillfull warning we're seeing, the OSD is only 77% utilized. Not exactly sure why it would be warning us when it's not near backfillfull_ratio.

$ ceph osd dump | grep ratio
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85

Our total capacity used is at 64%

Performance seems alright, a little slower than normal. There was a period of time where the CephFS was very slow with file operations taking between 10 - 30 seconds to complete (rgw was fine during that time). That has seems to have cleared out now.

Does any of the behavior described seem normal? Should we be concerned about anything?

Thanks!

--

Vincent Chu

A-4: Advanced Research in Cyber Systems

Los Alamos National Laboratory
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx