Max PGs per OSD creation limit?

greg@xxxxxxxxxxx (Gregory Farnum) · Mon, 14 Jul 2014 07:16:35 -0700



On Mon, Jul 14, 2014 at 2:16 AM, Christian Balzer <chibi at gol.com> wrote:
>
> Hello,
>
> new firefly cluster, currently just 1 storage node with 8 OSDs (3TB HDDs,
> journals on 4 DC3700 SSDs), the rest of the storage nodes are in the queue
> and 3 mons.  Thus replication of 1.
>
> Now this is the 2nd incarnation if this "cluster", I did a first one a few
> days ago and this did NOT happen then.
> Neither was any software changed or updated and I definitely didn't see
> that with my emperor cluster when I increased PG_NUMs early in it's life.
>
> ---
> root at ceph-01:~# ceph osd pool set rbd pg_num 1024
> Error E2BIG: specified pg_num 1024 is too large (creating 960 new PGs on ~8 OSDs exceeds per-OSD max of 32)
> ---
>
> And indeed when limiting it to 256 it worked (and so did further
> increases, albeit in steps of 256).
>
> While I see _why_ one would want to limit things like this that could lead
> to massive data movement, when and where was this limit introduced?
> Is it maybe triggered by data present, even if that isn't actual Ceph data
> like this:
> ---
>      osdmap e86: 8 osds: 8 up, 8 in
>       pgmap v444: 1152 pgs, 3 pools, 0 bytes data, 0 objects
>             384 MB used, 22344 GB / 22345 GB avail
>                 1152 active+clean
> ---
>
> Also for the performance keeping record, I tested this cluster with rados
> bench (write) and a block size of 4K.
> At 256 PGs (and PGPs, before somebody asks) it was capable of 1500 IOPS.
> At 1024 PGS it was capable of 3500 IOPS, with clearly higher CPU usage,
> but very much within the capabilities of the machine.

This was added in response to a user request in one of the pre-Firefly
dev releases and back ported to Dumpling. I don't think it depends on
data in the pool, but it is not in effect when *creating* new pools,
only when *splitting* the PGs involved. That's because splits are a
little more expensive for the OSD, and have to happen synchronously
instead of asynchronously. This limit is an attempt to prevent you
from shooting yourself in the foot and accidentally knocking cluster
IO completely offline for a few tens of seconds.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com