Re: Increasing the number of PGs

Sage Weil <sage@xxxxxxxxxxxx> · Wed, 28 Mar 2012 09:21:30 -0700 (PDT)

Hi David-

On Wed, 28 Mar 2012, David McBride wrote:

> On Tue, 2012-03-27 at 22:04 +0100, David McBride wrote:
> > On Tue, 2012-03-27 at 11:06 -0700, Sage Weil wrote:
> > 
> > > This shouldn't change the PG count either.  If you do
> > > 
> > >  ceph osd dump | grep ^pool
> > > 
> > > you'll see a pg_num value for each pool that should remain constant.  
> > > Only the size should change (replica count)...
> > 
> > Okay, that's what I was expecting.  I earlier rebuilt the cluster and
> > repeated my earlier results; however, I don't have the output of those
> > commands to hand.
> 
> Hi,
> 
> Results are in.  Something odd is going on; the results returned by
> `ceph -s` and `ceph osd dump` are inconsistent:
> 
> * `ceph osd dump` does indeed indicate that the pg_num values are 
>   remaining constant for each pool before and after changing the replica
>   count.
> 
> * However, the total number of PGs being reported by `ceph -s` or 
>   `ceph -w` increases immediately after issuing the replica count change
>   command for a pool.  The increase in size is equal to the number of 
>   live OSDs; in this case, 28.
> 
> * This apparent (silent) increase in PG count will occur three times if
>   the change is applied to all three pools, `data`, `metadata`, and 
>   `rbd`.  
> 
> * Changing the replica count up and down again after the initial  
>   increase has no effect on the reported replica count.
> 
> * My steps for reproducing are:
> 
>   - Mint a new cluster, with 14 OSDs stored on server A.
>   - Start the cluster.
>   - Add some data to the `rbd` pool using `rados bench`.
>   - Initialize 14 additional OSDs on server B.
>   - Add the server B OSDs to the cluster.
>   - Increase the replica count.
> 
>   This process is probably not minimal.  I can try to run some 
>   experiments to see what factors are significant.
>   (I'm pretty sure I could skip the `rados bench` step, for example.)
> 
> * In case it makes a difference, I'm using XFS, not BTRFS, for the OSD's
>   backing-store.
> 
> 
> Here's the output of ceph status commands during the various stages:
> 
> 
> Prior to OSD addition:
> ======================
> 
> output from: `ceph -s`:
> 
> > 2012-03-28 12:30:25.086686    pg v133: 2772 pgs: 2772 active+clean; 13252 MB data, 55720 MB used, 1857 GB / 1911 GB avail
> > 2012-03-28 12:30:25.101300   mds e1: 0/0/1 up
> > 2012-03-28 12:30:25.101424   osd e11: 14 osds: 14 up, 14 in
> > 2012-03-28 12:30:25.101689   log 2012-03-28 12:25:22.596734 mon.0 146.169.21.55:6789/0 16 : [INF] osd.8 146.169.1.13:6836/6339 boot
> > 2012-03-28 12:30:25.101897   mon e1: 1 mons at {vm-cephhead=146.169.21.55:6789/0}
> 
> output from: `ceph osd dump | grep pg_num`:
> 
> > pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 896 pgp_num 896 lpg_num 2 lpgp_num 2 last_change 1 owner 0 crash_replay_interval 45
> > pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num 896 pgp_num 896 lpg_num 2 lpgp_num 2 last_change 1 owner 0
> > pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 896 pgp_num 896 lpg_num 2 lpgp_num 2 last_change 1 owner 0

Oh!  lpg_num is > 0, which means a small number of "localized" pgs are 
created for every OSD.  This aren't used by anything currently (they were 
originally added to support hadoop-style placement, but even there we 
don't use them).

I'm guessing the pg count jumped after you added OSDs, not when you 
adjusted the replica count.  You can confirm by looking at

 ceph pg dump

before and after, and you should see that the new pgs all have a 'p##' at 
the end (where the ## is the osd they are localized to).

We probably want to turn those off by default, since they are unused.

sage

> 
> 
> Adding the second set of OSDs:
> =============================
> 
> output from: `ceph osd dump | grep pg_num`:
> 
> > pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 896 pgp_num 896 lpg_num 2 lpgp_num 2 last_change 1 owner 0 crash_replay_interval 45
> > pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num 896 pgp_num 896 lpg_num 2 lpgp_num 2 last_change 1 owner 0
> > pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 896 pgp_num 896 lpg_num 2 lpgp_num 2 last_change 1 owner 0
> 
> Changing `rbd` pool replica count:
> =================================
> 
> output from: `ceph -w`:
> 
> > 2012-03-28 12:36:36.768714    pg v313: 2772 pgs: 2772 active+clean; 13252 MB data, 86002 MB used, 3739 GB / 3823 GB avail
> > 2012-03-28 12:36:37.763292    pg v314: 2772 pgs: 2772 active+clean; 13252 MB data, 85163 MB used, 3740 GB / 3823 GB avail
> 
> (issued: `ceph osd pool set rbd size 3`)
> 
> > 2012-03-28 12:36:42.308575    pg v315: 2800 pgs: 28 creating, 2772 active+clean; 13252 MB data, 85163 MB used, 3740 GB / 3823 GB avail
> > 2012-03-28 12:36:42.314124   osd e105: 28 osds: 28 up, 28 in
> > 2012-03-28 12:36:43.399792    pg v316: 2800 pgs: 28 creating, 2772 active+clean; 13252 MB data, 85163 MB used, 3740 GB / 3823 GB avail
> > 2012-03-28 12:36:43.402742   osd e106: 28 osds: 28 up, 28 in
> > 2012-03-28 12:36:46.691598    pg v317: 2800 pgs: 28 creating, 2737 active+clean, 35 active+recovering; 13252 MB data, 84818 MB used, 3740 GB / 3823 GB avail; 274/6765 degraded (4.050%)
> > 2012-03-28 12:36:47.596507    pg v318: 2800 pgs: 28 creating, 2709 active+clean, 63 active+recovering; 13252 MB data, 84819 MB used, 3740 GB / 3823 GB avail; 524/6890 degraded (7.605%)
> 
> output from: `ceph osd dump | grep pg_num`:
> 
> > pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 896 pgp_num 896 lpg_num 2 lpgp_num 2 last_change 1 owner 0 crash_replay_interval 45
> > pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num 896 pgp_num 896 lpg_num 2 lpgp_num 2 last_change 1 owner 0
> > pool 2 'rbd' rep size 3 crush_ruleset 2 object_hash rjenkins pg_num 896 pgp_num 896 lpg_num 2 lpgp_num 2 last_change 105 owner 0
> 
> 
> Changing `data` pool replica count:
> ==================================
> 
> output from: `ceph -w`:
> 
> > 2012-03-28 13:10:54.818447    pg v573: 2800 pgs: 14 creating, 2786 active+clean; 13252 MB data, 98329 MB used, 3727 GB / 3823 GB avail
> 
> (issued: `ceph osd pool set data size 3`)
> 
> > 2012-03-28 13:11:08.240605    pg v574: 2828 pgs: 42 creating, 2786 active+clean; 13252 MB data, 98329 MB used, 3727 GB / 3823 GB avail
> > 2012-03-28 13:11:08.245026   osd e114: 28 osds: 28 up, 28 in
> > 2012-03-28 13:11:09.050371    pg v575: 2828 pgs: 42 creating, 2786 active+clean; 13252 MB data, 98329 MB used, 3727 GB / 3823 GB avail
> > 2012-03-28 13:11:09.051179   osd e115: 28 osds: 28 up, 28 in
> 
> output from: `ceph osd dump | grep pg_num`:
> 
> > pool 0 'data' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 896 pgp_num 896 lpg_num 2 lpgp_num 2 last_change 114 owner 0 crash_replay_interval 45
> > pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num 896 pgp_num 896 lpg_num 2 lpgp_num 2 last_change 1 owner 0
> > pool 2 'rbd' rep size 3 crush_ruleset 2 object_hash rjenkins pg_num 896 pgp_num 896 lpg_num 2 lpgp_num 2 last_change 112 owner 0
> 
> 
> Increase metadata pool
> ======================
> 
> output from `ceph -w`:
> 
> > 2012-03-28 13:12:04.279557    pg v580: 2828 pgs: 28 creating, 2800 active+clean; 13252 MB data, 98338 MB used, 3727 GB / 3823 GB avail
> 
> (issued: `ceph osd pool set metadata size 3`)
> 
> > 2012-03-28 13:13:19.748554    pg v581: 2856 pgs: 56 creating, 2800 active+clean; 13252 MB data, 98338 MB used, 3727 GB / 3823 GB avail
> > 2012-03-28 13:13:19.753181   osd e116: 28 osds: 28 up, 28 in
> > 2012-03-28 13:13:20.840151    pg v582: 2856 pgs: 56 creating, 2800 active+clean; 13252 MB data, 98338 MB used, 3727 GB / 3823 GB avail
> > 2012-03-28 13:13:20.842065   osd e117: 28 osds: 28 up, 28 in
> 
> output from: `ceph osd dump | grep pg_num`: 
> 
> > pool 0 'data' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 896 pgp_num 896 lpg_num 2 lpgp_num 2 last_change 114 owner 0 crash_replay_interval 45
> > pool 1 'metadata' rep size 3 crush_ruleset 1 object_hash rjenkins pg_num 896 pgp_num 896 lpg_num 2 lpgp_num 2 last_change 116 owner 0
> > pool 2 'rbd' rep size 3 crush_ruleset 2 object_hash rjenkins pg_num 896 pgp_num 896 lpg_num 2 lpgp_num 2 last_change 112 owner 0
> 
> 
> This sounds like it's probably defect.  Should I mint a new bug ticket in the tracker?
> 
> Cheers,
> David
> -- 
> David McBride <dwm@xxxxxxxxxxxx>
> Department of Computing, Imperial College, London
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html