Re: Increasing pg_num

Christian Balzer <chibi@xxxxxxx> · Tue, 17 May 2016 10:41:52 +0900

Hello,

On Tue, 17 May 2016 10:47:15 +1000 Chris Dunlop wrote:

> On Tue, May 17, 2016 at 08:21:48AM +0900, Christian Balzer wrote:
> > On Mon, 16 May 2016 22:40:47 +0200 (CEST) Wido den Hollander wrote:
> > > 
> > > pg_num is the actual amount of PGs. This you can increase without any
> > > actual data moving.
> >
> > Yes and no.
> > 
> > Increasing the pg_num will split PGs, which causes potentially massive
> > I/O. Also AFAIK that I/O isn't regulated by the various recovery and
> > backfill parameters.
> 
> Where is this potentially massive I/O coming from? I have this naive
> concept that the PGs are mathematically-calculated buckets, so splitting
> them would involve little or no I/O, although I can imagine there are
> management overheads (cpu, memory) involved in correctly maintaining
> state during the splitting process.
>
I would have thought "splitting" to be pretty unambiguous, in that it
involves moving data.

That's on top of course for the CPU/RAM resources needed when creating
those new PGs and having them peer.

Most your questions would be easily answered if you did spend a few
minutes with even the crappiest test cluster and observing things (with
atop and the likes). 

To wit, this is a test pool (12) created with 32 PGs and slightly filled
with data via rados bench:
---
# ls -la /var/lib/ceph/osd/ceph-8/current/ |grep "12\."
drwxr-xr-x   2 root root  4096 May 17 10:04 12.13_head
drwxr-xr-x   2 root root  4096 May 17 10:04 12.1e_head
drwxr-xr-x   2 root root  4096 May 17 10:04 12.b_head
# du -h /var/lib/ceph/osd/ceph-8/current/12.13_head/
121M    /var/lib/ceph/osd/ceph-8/current/12.13_head/
---

After increasing that to 128 PGs we get this:
---
# ls -la /var/lib/ceph/osd/ceph-8/current/ |grep "12\."
drwxr-xr-x   2 root root  4096 May 17 10:18 12.13_head
drwxr-xr-x   2 root root  4096 May 17 10:18 12.1e_head
drwxr-xr-x   2 root root  4096 May 17 10:18 12.2b_head
drwxr-xr-x   2 root root  4096 May 17 10:18 12.33_head
drwxr-xr-x   2 root root  4096 May 17 10:18 12.3e_head
drwxr-xr-x   2 root root  4096 May 17 10:18 12.4b_head
drwxr-xr-x   2 root root  4096 May 17 10:18 12.53_head
drwxr-xr-x   2 root root  4096 May 17 10:18 12.5e_head
drwxr-xr-x   2 root root  4096 May 17 10:18 12.6b_head
drwxr-xr-x   2 root root  4096 May 17 10:18 12.73_head
drwxr-xr-x   2 root root  4096 May 17 10:18 12.7e_head
drwxr-xr-x   2 root root  4096 May 17 10:18 12.b_head
# du -h /var/lib/ceph/osd/ceph-8/current/12.13_head/
25M     /var/lib/ceph/osd/ceph-8/current/12.13_head/
---

Now this was fairly uneventful even on my crappy test cluster, given the
small amount of data (which was mostly cached) and the fact that it's idle.

However consider this with 100's of GB per PG and a busy cluster and you
get the idea where massive and very disruptive I/O comes from.

> > That's probably why recent Ceph versions will only let you increase
> > pg_num in smallish increments. 
> 
> Oh, I wasn't aware of that!
> 
> Ok, so it looks like it's mon_osd_max_split_count, introduced by commit
> d8ccd73. Unfortunately it seems to be missing from the ceph docs. It's
> mentioned in the Suse docs:
> 
> https://www.suse.com/documentation/ses-2/singlehtml/book_storage_admin/book_storage_admin.html#storage.bp.cluster_mntc.add_pgnum
> 
> ...although, if I'm understanding "mon_osd_max_split_count" correctly,
> their script for calculating the maximum to which you can increase
> pg_num is incorrect in that it's calculating "current pg_num +
> mon_osd_max_split_count" when it should be "current pg_num +
> (mon_osd_max_split_count * number of pool OSDs)".
> 
> Hmmm, is there a generic command-line(ish) way of determining the number
> of OSDs involved in a pool?
> 
Unless you have a pool with a very small pg_num and a very large cluster
the answer usually tends to be "all of them".

And google ("ceph number of osds per pool") is your friend:

http://cephnotes.ksperis.com/blog/2015/02/23/get-the-number-of-placement-groups-per-osd

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com