Hi Cephers,
This is a greeting from Jevon. Currently, I'm experiencing an issue which suffers me a lot, so I'm writing to ask for your comments/help/suggestions. More details are provided bellow.
Issue:
I set up a cluster having 24 OSDs and created one pool with 1024 placement groups on it for a small startup company. The number 1024 was calculated per the equation 'OSDs * 100'/pool size. The cluster have been running quite well for a long time. But recently, our monitoring system always complains that some disks' usage exceed 85%. I log into the system and find out that some disks' usage are really very high, but some are not(less than 60%). Each time when the issue happens, I have to manually re-balance the distribution. This is a short-term solution, I'm not willing to do it all the time.
Two long-term solutions come in my mind,
1) Ask the customers to expand their clusters by adding more OSDs. But I think they will ask me to explain the reason of the imbalance data distribution. We've already done some analysis on the environment, we learned that the most imbalance part in the CRUSH is the mapping between object and pg. The biggest pg has 613 objects, while the smallest pg only has 226 objects.
2) Increase the number of placement groups. It can be of great help for statistically uniform data distribution, but it can also incur significant data movement as PGs are effective being split. I just cannot do it in our customers' environment before we 100% understand the consequence. So anyone did this under a production environment? How much does this operation affect the performance of Clients?
Any comments/help/suggestions will be highly appreciated.
Best Regards
Jevon
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com