Re: estimate the impact of changing pg_num

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Sun, 1 Feb 2015 21:53:41 +0100

Hi,

When do you see thousands of slow requests during recovery... Does that happen even with single OSD failures? You should be able to recover disks without slow requests.
I always run with recovery op priority at the minimum 1. Tweaking the number of max backfills did not change much during that recent splitting exercise.
Which Ceph version are you running? There have been snap trim related recovery problems that have only recently been fixed in production releases. 0.80.8 is OK, but I don't know about giant...
Cheers, Dan
On 1 Feb 2015 21:39, "Xu (Simon) Chen" <xchenum@xxxxxxxxx> wrote:

>

> In my case, each object is 8MB (glance default for storing images on rbd backend.) RBD doesn't work extremely well when ceph is recovering - it is common to see hundreds or a few thousands of blocked requests (>30s to finish). This translates high IO wait inside of VMs, and many applications don't deal with this well.

>

> I am not convinced that increase pg_num gradually is the right way to go. Have you tried giving backfilling traffic very low priorities?

>

> Thanks.

> -Simon

>

> On Sun, Feb 1, 2015 at 2:39 PM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:

>>

>> Hi,

>> I don't know the general calculation, but last week we split a pool with 20 million tiny objects from 512 to 1024 pgs, on a cluster with 80 OSDs. IIRC around 7 million objects needed to move, and it took around 13 hours to finish. The bottleneck in our case was objects per second (limited to around 1000/s), not network throughput (which never exceeded ~50MB/s).

>>

>> It wasn't completely transparent... the time to write a 4kB object increased from 5ms to around 30ms during this splitting process.

>>

>> I would guess that if you split from 1k to 8k pgs, around 80% of your data will move. Basically, 7 out of 8 objects will be moved to a new primary PG, but any objects that end up with 2nd or 3rd copies on the first 1k PGs should not need to be moved.

>>

>> I'd also be interested to hear of similar splitting experiences. We've been planning a similar intervention on our larger cluster to move from 4k PGs to 16k. I have been considering making the change gradually (10-100 PGs at a time) instead of all at once. This approach would certainly lower the performance impact, but would take much much longer to complete. I wrote a short script to perform this gentle splitting here:  https://github.com/cernceph/ceph-scripts/blob/master/tools/split/ceph-gentle-split

>>

>> Be sure to understand what it's doing before trying it.

>>

>> Cheers,

>> Dan

>>

>> On 1 Feb 2015 18:21, "Xu (Simon) Chen" <xchenum@xxxxxxxxx> wrote:

>>>

>>> Hi folks,

>>>

>>> I was running a ceph cluster with 33 OSDs. More recently, 33x6 new OSDs hosted on 33 new servers were added, and I have finished balancing the data and then marked the 33 old OSDs out.

>>>

>>> As I have 6x as many OSDs, I am thinking of increasing pg_num of my largest pool from 1k to at least 8k. What worries me is that this cluster has around 10M objects and is supporting many production VMs with RBD.

>>>

>>> I am wondering if there is a good way to estimate the amount of data that will be shuffled after I increase the PG_NUM. I want to make sure this can be done within a reasonable amount of time, such that I can declare a proper maintenance window (either over night, or throughout a weekend..)

>>>

>>> Thanks!

>>> -Simon

>>>

>>> _______________________________________________

>>> ceph-users mailing list

>>> ceph-users@xxxxxxxxxxxxxx

>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>

>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com