We've been tracking and investigating a few issues with PGs being different sizes leading to OSDs being different as well as our snap_trimq for pgs not emptying.
We noticed 2 things separately and then realized they were related.
1) We have a very good way to balance our cluster, but we had an OSD in 2 different clusters being 10% more full than anything else.
2) As monitoring, we query 300 random PGs every 5 minutes and calculate out what our total snap_trimq would be if those 300 PGs were typical of the 32k PGs in the cluster. We are seeing that 2 of our clusters never get close to catching up on their snap_trimq.
We realized that these 2 clusters are the same clusters and that these problems might be related. A `du -sh` of PGs show that the PGs primary to the OSDs that are 10% more full are 10GB (~30%) larger each than the PGs primary to other OSDs in the cluster and that the snap_trimq on those PGs is at a size that accounts for the extra 10GB.
We have been able to clean up one of the OSDs by setting it's snap_trim_sleep to 0.0 from our current setting of 0.25 as well as triggering a reweight to move some data off of the OSD. We're currently testing only adjusting the snap_trim_sleep down to 0.0 to fix this problem for future OSDs and it is looking promising. Lowering it to 0.05 had no noticeable effect. We are deleting ~ 5k snapshots every day in these clusters with 32k PGs and 1000+ OSDs. We have one cluster with 32k PGs and 957 OSDs that isn't exhibiting this behavior to the same extent yet, although it is no longer getting down to an empty snap_trimq each day.
Does anyone have any theories or experiences with problems like this? Thank you for your help.
_______________________________________________ Ceph-large mailing list Ceph-large@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com