Hi, is adjusting crush weight really a good solution for this? Crush weight out of the box corresponds to OSD capacity in TB and this looks like a good “weight” to me. The issue is not in a bucket having wrong weight, but somewhere else depending on CRUSH. We actually use “osd reweight” for this (which is not good either because it’s nonpersistent) but we consider it temporary until we upgrade, switch to newer crush etc... Jan > On 05 Aug 2015, at 11:20, Haomai Wang <haomaiwang@xxxxxxxxx> wrote: > > On Wed, Aug 5, 2015 at 1:36 PM, 乔建峰 <scaleqiao@xxxxxxxxx> wrote: >> Add the mailing lists. >> >> 2015-08-05 13:34 GMT+08:00 乔建峰 <scaleqiao@xxxxxxxxx>: >>> >>> Hi Haomai, >>> >>> Thank you for the prompt response and the suggestion. >>> >>> I cannot agree with you more about using multiple pools in one flexible >>> cluster. Per the scenario you described below, we can create more pools when >>> expanding the cluster. But for the issue we are facing right now, creating >>> new pool with proper pg_num/pgp_num might be only helpful for uniformly >>> distributing the data of new images. It could not relief the imbalance >>> within the existing data. Please correct me if I'm wrong. > > For the existing pool, you could adjust crush weight to get better > data balance. > >>> >>> Thanks, >>> Jevon >>> >>> 2015-08-04 22:01 GMT+08:00 Haomai Wang <haomaiwang@xxxxxxxxx>: >>>> >>>> On Mon, Aug 3, 2015 at 4:05 PM, 乔建峰 <scaleqiao@xxxxxxxxx> wrote: >>>>> [Including ceph-users alias] >>>>> >>>>> 2015-08-03 16:01 GMT+08:00 乔建峰 <scaleqiao@xxxxxxxxx>: >>>>>> >>>>>> Hi Cephers, >>>>>> >>>>>> Currently, I'm experiencing an issue which suffers me a lot, so I'm >>>>>> writing to ask for your comments/help/suggestions. More details are >>>>>> provided >>>>>> bellow. >>>>>> >>>>>> Issue: >>>>>> I set up a cluster having 24 OSDs and created one pool with 1024 >>>>>> placement >>>>>> groups on it for a small startup company. The number 1024 was >>>>>> calculated per >>>>>> the equation (OSDs * 100)/pool size. The cluster have been running >>>>>> quite >>>>>> well for a long time. But recently, our monitoring system always >>>>>> complains >>>>>> that some disks' usage exceed 85%. I log into the system and find out >>>>>> that >>>>>> some disks' usage are really very high, but some are not(less than >>>>>> 60%). >>>>>> Each time when the issue happens, I have to manually re-balance the >>>>>> distribution. This is a short-term solution, I'm not willing to do it >>>>>> all >>>>>> the time. >>>>>> >>>>>> Two long-term solutions come in my mind, >>>>>> 1) Ask the customers to expand their clusters by adding more OSDs. But >>>>>> I >>>>>> think they will ask me to explain the reason of the imbalance data >>>>>> distribution. We've already done some analysis on the environment, we >>>>>> learned that the most imbalance part in the CRUSH is the mapping >>>>>> between >>>>>> object and pg. The biggest pg has 613 objects, while the smallest pg >>>>>> only >>>>>> has 226 objects. >>>>>> >>>>>> 2) Increase the number of placement groups. It can be of great help >>>>>> for >>>>>> statistically uniform data distribution, but it can also incur >>>>>> significant >>>>>> data movement as PGs are effective being split. I just cannot do it in >>>>>> our >>>>>> customers' environment before we 100% understand the consequence. So >>>>>> anyone >>>>>> did this under a production environment? How much does this operation >>>>>> affect >>>>>> the performance of Clients? >>>>>> >>>>>> Any comments/help/suggestions will be highly appreciated. >>>> >>>> Of course not, pg split isn't a recommend process for running cluster. >>>> It will block the client IO totally. Instead of recovering process >>>> which will make object level control, split is a pg-level process and >>>> osd itself can't control it smoothly. In theory if we need to make pg >>>> split work at real cluster, we need to do more things at MON and lots >>>> of logic will make trouble. Although we can't enjoy the flexible via >>>> pg split, we can get the same result from *pool* with a little user >>>> management logics. >>>> >>>> "pool" is good thing which can cover your need. Most users always like >>>> to have one pool for the whole cluster, it's fine for immutable >>>> cluster but not good for a flexible cluster I think. For example, if >>>> double osd nodes, create a new pool is a better way than preparing a >>>> pool with lots of pgs at a very beginning. If using openstack, >>>> cloudstack or else, these cloud projects can provide with upper >>>> control with "volume_type". >>>> >>>> In a word, we can enjoy increasing osds with a relatively small >>>> account. But I think we can't feel free to double the ceph cluster and >>>> hoping ceph could do it perfectly. >>>> >>>>>> >>>>>> -- >>>>>> Best Regards >>>>>> Jevon >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Best Regards >>>>> Jevon >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@xxxxxxxxxxxxxx >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>>> >>>> >>>> >>>> -- >>>> Best Regards, >>>> >>>> Wheat >>> >>> >>> >>> >>> -- >>> Best Regards >>> Jevon >> >> >> >> >> -- >> Best Regards >> Jevon > > > > -- > Best Regards, > > Wheat > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com