Re: Is it safe to increase pg numbers in a production environment

Haomai Wang <haomaiwang@xxxxxxxxx> · Wed, 5 Aug 2015 17:20:02 +0800

On Wed, Aug 5, 2015 at 1:36 PM, 乔建峰 <scaleqiao@xxxxxxxxx> wrote:
> Add the mailing lists.
>
> 2015-08-05 13:34 GMT+08:00 乔建峰 <scaleqiao@xxxxxxxxx>:
>>
>> Hi Haomai,
>>
>> Thank you for the prompt response and the suggestion.
>>
>> I cannot agree with you more about using multiple pools in one flexible
>> cluster. Per the scenario you described below, we can create more pools when
>> expanding the cluster. But for the issue we are facing right now, creating
>> new pool with proper pg_num/pgp_num might be only helpful for uniformly
>> distributing the data of new images.  It could not relief the imbalance
>> within the existing data. Please correct me if I'm wrong.

For the existing pool, you could adjust crush weight to get better
data balance.

>>
>> Thanks,
>> Jevon
>>
>> 2015-08-04 22:01 GMT+08:00 Haomai Wang <haomaiwang@xxxxxxxxx>:
>>>
>>> On Mon, Aug 3, 2015 at 4:05 PM, 乔建峰 <scaleqiao@xxxxxxxxx> wrote:
>>> > [Including ceph-users alias]
>>> >
>>> > 2015-08-03 16:01 GMT+08:00 乔建峰 <scaleqiao@xxxxxxxxx>:
>>> >>
>>> >> Hi Cephers,
>>> >>
>>> >> Currently, I'm experiencing an issue which suffers me a lot, so I'm
>>> >> writing to ask for your comments/help/suggestions. More details are
>>> >> provided
>>> >> bellow.
>>> >>
>>> >> Issue:
>>> >> I set up a cluster having 24 OSDs and created one pool with 1024
>>> >> placement
>>> >> groups on it for a small startup company. The number 1024 was
>>> >> calculated per
>>> >> the equation (OSDs * 100)/pool size. The cluster have been running
>>> >> quite
>>> >> well for a long time. But recently, our monitoring system always
>>> >> complains
>>> >> that some disks' usage exceed 85%. I log into the system and find out
>>> >> that
>>> >> some disks' usage are really very high, but some are not(less than
>>> >> 60%).
>>> >> Each time when the issue happens, I have to manually re-balance the
>>> >> distribution. This is a short-term solution, I'm not willing to do it
>>> >> all
>>> >> the time.
>>> >>
>>> >> Two long-term solutions come in my mind,
>>> >> 1) Ask the customers to expand their clusters by adding more OSDs. But
>>> >> I
>>> >> think they will ask me to explain the reason of the imbalance data
>>> >> distribution. We've already done some analysis on the environment, we
>>> >> learned that the most imbalance part in the CRUSH is the mapping
>>> >> between
>>> >> object and pg. The biggest pg has 613 objects, while the smallest pg
>>> >> only
>>> >> has 226 objects.
>>> >>
>>> >> 2) Increase the number of placement groups. It can be of great help
>>> >> for
>>> >> statistically uniform data distribution, but it can also incur
>>> >> significant
>>> >> data movement as PGs are effective being split. I just cannot do it in
>>> >> our
>>> >> customers' environment before we 100% understand the consequence. So
>>> >> anyone
>>> >> did this under a production environment? How much does this operation
>>> >> affect
>>> >> the performance of Clients?
>>> >>
>>> >> Any comments/help/suggestions will be highly appreciated.
>>>
>>> Of course not, pg split isn't a recommend process for running cluster.
>>> It will block the client IO totally. Instead of recovering process
>>> which will make object level control, split is a pg-level process and
>>> osd itself can't control it smoothly. In theory if we need to make pg
>>> split work at real cluster, we need to do more things at MON and lots
>>> of logic will make trouble. Although we can't enjoy the flexible via
>>> pg split, we can get the same result from *pool* with a little user
>>> management logics.
>>>
>>> "pool" is good thing which can cover your need. Most users always like
>>> to have one pool for the whole cluster, it's fine for immutable
>>> cluster but not good for a flexible cluster I think. For example, if
>>> double osd nodes, create a new pool is a better way than preparing a
>>> pool with lots of pgs at a very beginning.  If using openstack,
>>> cloudstack or else, these cloud projects can provide with upper
>>> control with "volume_type".
>>>
>>> In a word, we can enjoy increasing osds with a relatively small
>>> account. But I think we can't feel free to double the ceph cluster and
>>> hoping ceph could do it perfectly.
>>>
>>> >>
>>> >> --
>>> >> Best Regards
>>> >> Jevon
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > Best Regards
>>> > Jevon
>>> >
>>> > _______________________________________________
>>> > ceph-users mailing list
>>> > ceph-users@xxxxxxxxxxxxxx
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>>
>>> Wheat
>>
>>
>>
>>
>> --
>> Best Regards
>> Jevon
>
>
>
>
> --
> Best Regards
> Jevon

-- 
Best Regards,

Wheat
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com