Re: estimate the impact of changing pg_num

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Sun, 1 Feb 2015 22:10:01 +0100

Hi,

On 1 Feb 2015 22:04, "Xu (Simon) Chen" <xchenum@xxxxxxxxx> wrote:

>

> Dan,

>

> I alway have noout set, so that single OSD failures won't trigger any recovery immediately. When the OSD (or sometimes multiple OSDs on the same server) comes back, I do see slow requests during backfilling, but probably not thousands. When I added a brand new OSD into the cluster, for some reason ~2000 of blocked requests would show up initially, but gradually reduced to a few hundred and eventually to 10s.

>

> I don't think I touched "recovery op priority", which means it's default to 10 - I'll try reducing it to 1 in a bit. I do set max backfill to 1 though.

>

> I am running 0.80.7, and probably will update to 0.80.8 soon.

>
Does your Glance images pool have many purged_snaps? Either ceph osd dump or ceph pg dump will tell you that (sorry I'm mobile so can't answer precisely). If you have a large purged_snaps set on the images pool, then I'd bet you're suffering from the snap trim issue I mentioned. 0.80.8 fixes it... You won't see slow requests anymore.
Cheers, Dan

> Thanks.

> -Simon

>

>

> On Sun, Feb 1, 2015 at 3:53 PM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:

>>

>> Hi,

>> When do you see thousands of slow requests during recovery... Does that happen even with single OSD failures? You should be able to recover disks without slow requests.

>>

>> I always run with recovery op priority at the minimum 1. Tweaking the number of max backfills did not change much during that recent splitting exercise.

>>

>> Which Ceph version are you running? There have been snap trim related recovery problems that have only recently been fixed in production releases. 0.80.8 is OK, but I don't know about giant...

>>

>> Cheers, Dan

>>

>> On 1 Feb 2015 21:39, "Xu (Simon) Chen" <xchenum@xxxxxxxxx> wrote:

>> >

>> > In my case, each object is 8MB (glance default for storing images on rbd backend.) RBD doesn't work extremely well when ceph is recovering - it is common to see hundreds or a few thousands of blocked requests (>30s to finish). This translates high IO wait inside of VMs, and many applications don't deal with this well.

>> >

>> > I am not convinced that increase pg_num gradually is the right way to go. Have you tried giving backfilling traffic very low priorities?

>> >

>> > Thanks.

>> > -Simon

>> >

>> > On Sun, Feb 1, 2015 at 2:39 PM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:

>> >>

>> >> Hi,

>> >> I don't know the general calculation, but last week we split a pool with 20 million tiny objects from 512 to 1024 pgs, on a cluster with 80 OSDs. IIRC around 7 million objects needed to move, and it took around 13 hours to finish. The bottleneck in our case was objects per second (limited to around 1000/s), not network throughput (which never exceeded ~50MB/s).

>> >>

>> >> It wasn't completely transparent... the time to write a 4kB object increased from 5ms to around 30ms during this splitting process.

>> >>

>> >> I would guess that if you split from 1k to 8k pgs, around 80% of your data will move. Basically, 7 out of 8 objects will be moved to a new primary PG, but any objects that end up with 2nd or 3rd copies on the first 1k PGs should not need to be moved.

>> >>

>> >> I'd also be interested to hear of similar splitting experiences. We've been planning a similar intervention on our larger cluster to move from 4k PGs to 16k. I have been considering making the change gradually (10-100 PGs at a time) instead of all at once. This approach would certainly lower the performance impact, but would take much much longer to complete. I wrote a short script to perform this gentle splitting here:  https://github.com/cernceph/ceph-scripts/blob/master/tools/split/ceph-gentle-split

>> >>

>> >> Be sure to understand what it's doing before trying it.

>> >>

>> >> Cheers,

>> >> Dan

>> >>

>> >> On 1 Feb 2015 18:21, "Xu (Simon) Chen" <xchenum@xxxxxxxxx> wrote:

>> >>>

>> >>> Hi folks,

>> >>>

>> >>> I was running a ceph cluster with 33 OSDs. More recently, 33x6 new OSDs hosted on 33 new servers were added, and I have finished balancing the data and then marked the 33 old OSDs out.

>> >>>

>> >>> As I have 6x as many OSDs, I am thinking of increasing pg_num of my largest pool from 1k to at least 8k. What worries me is that this cluster has around 10M objects and is supporting many production VMs with RBD.

>> >>>

>> >>> I am wondering if there is a good way to estimate the amount of data that will be shuffled after I increase the PG_NUM. I want to make sure this can be done within a reasonable amount of time, such that I can declare a proper maintenance window (either over night, or throughout a weekend..)

>> >>>

>> >>> Thanks!

>> >>> -Simon

>> >>>

>> >>> _______________________________________________

>> >>> ceph-users mailing list

>> >>> ceph-users@xxxxxxxxxxxxxx

>> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>> >>>

>> >

>

>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com