Re: fast_read in EC pools

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 26 Feb 2018 22:48:15 +0000

On Mon, Feb 26, 2018 at 2:30 PM Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
Am 26.02.2018 um 23:15 schrieb Gregory Farnum:

>

>

> On Mon, Feb 26, 2018 at 11:48 AM Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>> wrote:

>

>     >     >     The EC pool I am considering is k=4 m=2 with failure domain host, on 6 hosts.

>     >     >     So necessarily, there is one shard for each host. If one host goes down for a prolonged time,

>     >     >     there's no "logical" advantage of redistributing things - since whatever you do, with 5 hosts, all PGs will stay in degraded state anyways.

>     >     >

>     >     >     However, I noticed Ceph is remapping all PGs, and actively moving data. I presume now this is done for two reasons:

>     >     >     - The remapping is needed since the primary OSD might be the one which went down. But for remapping (I guess) there's no need to actually move data,

>     >     >       or is there?

>     >     >     - The data movement is done to have the "k" shards available.

>     >     >     If it's really the case that "all shards are equal", then data movement should not occur - or is this a bug / bad feature?

>     >     >

>     >     >

>     >     > If you lose one OSD out of a host, Ceph is going to try and re-replicate the data onto the other OSDs in that host. Your PG size and the CRUSH rule instructs it that the PG needs 6 different OSDs, and those OSDs need to be placed on different hosts.

>     >     >

>     >     > You're right that gets very funny if your PG size is equal to the number of hosts. We generally discourage people from running configurations like that.

>     >

>     >     Yes. k=4 with m=2 with 6 hosts (i.e. possibility to lose 2 hosts) would be our starting point - since we may add more hosts later (not too soon-ish, but it's not excluded more may come in a year or so),

>     >     and migrating large EC pools to different settings still seems a bit messy.

>     >     We can't really afford to reduce available storage significantly more in the current setup, and would like to have the possibility to lose one host (for example for an OS upgrade),

>     >     and then still lose a few disks in case they fail with bad timing.

>     >

>     >     >

>     >     > Or if you mean that you are losing a host, and the data is shuffling around on the remaining hosts...hrm, that'd be weird. (Perhaps a result of EC pools' "indep" rather than "firstn" crush rules?)

>     >

>     >     They are indep, which I think is the default (no manual editing done). I thought the main goal of indep was exactly to reduce data movement.

>     >     Indeed, it's very funny that data is moved, it certainly does not help to increase redundancy ;-).

>     >

>     <snip>

>     >

>     > Can you also share the output of "ceph osd crush dump"?

>

>     Attached.

>

>

> Yep, that all looks simple enough.

>

> Do you have any "ceph -s" or other records from when this was occurring? Is it actually deleting or migrating any of the existing shards, or is it just that the shards which were previously on the out'ed OSDs are now getting copied onto the remaining ones?

>

> I think I finally understand what's happening here but would like to be sure. :)

> -Greg

>

> (In short: certain straws were previously mapping onto osd.[outed], but now they map onto the remaining OSDs. Because everything's independent, the actual CRUSH mapping for any shard other than the last is now going to map onto a remaining OSD, which would displace the shard it already holds. But the previously-present shard is going to remain "remapped" there because it can't map successfully. So if you lose osd.5, you'll go from a CRUSH mapping like [1,3,5,0,2,4] to [1,3,4,0,2,UNMAPPED], but in reality shards 2 and 5 will both be on OSD 4.)

Interesting! This would also mean that space usage on the remaining-active OSDs would increase by 1/6 in our setup, which is significant.

So that's another good reason to use mon_osd_down_out_subtree_limit=host or to just set "ceph osd set noout" when actively reinstalling a host.

I reproduced just now. Here's what I see (ignore the inconsistent PG, that's unrelated and likely a cause of previous OSD OOM issues):

# ceph -s

  cluster:

    id:     69b1fbe5-f084-4410-a99a-ab57417e7846

    health: HEALTH_ERR

            41569430/513248666 objects misplaced (8.099%)

            1 scrub errors

            Possible data damage: 1 pg inconsistent

            Degraded data redundancy: 105575103/513248666 objects degraded (20.570%), 2176 pgs degraded, 985 pgs undersized

  services:

    mon: 3 daemons, quorum mon003,mon001,mon002

    mgr: mon002(active), standbys: mon001, mon003

    mds: cephfs_baf-1/1/1 up  {0=mon002=up:active}, 1 up:standby-replay, 1 up:standby

    osd: 196 osds: 164 up, 164 in; 1166 remapped pgs

  data:

    pools:   2 pools, 2176 pgs

    objects: 89370k objects, 4488 GB

    usage:   29546 GB used, 555 TB / 584 TB avail

    pgs:     105575103/513248666 objects degraded (20.570%)

             41569430/513248666 objects misplaced (8.099%)

             1166 active+undersized+degraded+remapped+backfilling

             1009 active+undersized+degraded

             1    active+undersized+degraded+inconsistent

  io:

    client:   6784 kB/s rd, 6820 kB/s wr, 804 op/s rd, 1174 op/s wr

    recovery: 79333 kB/s, 27 keys/s, 1080 objects/s

In ceph health detail, I see:

    pg 2.7cd is active+undersized+degraded+remapped+backfilling, acting [91,63,33,163,2147483647,103]

    pg 2.7ce is stuck undersized for 114.063431, current state active+undersized+degraded+remapped+backfilling, last acting [31,121,157,2147483647,61,87]

    pg 2.7cf is stuck undersized for 110.842287, current state active+undersized+degraded+remapped+backfilling, last acting [163,36,2147483647,21,124,69]

    pg 2.7d0 is stuck undersized for 118.876276, current state active+undersized+degraded+remapped+backfilling, last acting [140,91,66,22,2147483647,112]

    pg 2.7d1 is stuck undersized for 388.377010, current state active+undersized+degraded, last acting [62,110,2147483647,31,141,81]

    pg 2.7d2 is stuck undersized for 111.265718, current state active+undersized+degraded+remapped+backfilling, last acting [54,125,2147483647,157,88,21]

    pg 2.7d3 is stuck undersized for 105.885607, current state active+undersized+degraded+remapped+backfilling, last acting [20,117,96,2147483647,144,54]

    pg 2.7d4 is stuck undersized for 112.693680, current state active+undersized+degraded+remapped+backfilling, last acting [105,145,71,60,2147483647,13]

    pg 2.7d5 is stuck undersized for 388.337919, current state active+undersized+degraded, last acting [142,90,19,60,2147483647,127]

[...]

While I saw, when the host's OSDs were only down, but still in:

    pg 2.7cd is active+undersized+degraded, acting [91,63,33,163,2147483647,103]

    pg 2.7ce is stuck undersized for 145.507311, current state active+undersized+degraded, last acting [31,121,157,2147483647,61,87]

    pg 2.7cf is stuck undersized for 143.293067, current state active+undersized+degraded, last acting [163,36,2147483647,21,124,69]

    pg 2.7d0 is stuck undersized for 145.461503, current state active+undersized+degraded, last acting [140,91,66,22,2147483647,112]

    pg 2.7d1 is stuck undersized for 145.496089, current state active+undersized+degraded, last acting [62,110,2147483647,31,141,81]

    pg 2.7d2 is stuck undersized for 145.513296, current state active+undersized+degraded, last acting [54,125,2147483647,157,88,21]

    pg 2.7d3 is stuck undersized for 145.503361, current state active+undersized+degraded, last acting [20,117,96,2147483647,144,54]

    pg 2.7d4 is stuck undersized for 145.484259, current state active+undersized+degraded, last acting [105,145,71,60,2147483647,13]

    pg 2.7d5 is stuck undersized for 145.456998, current state active+undersized+degraded, last acting [142,90,19,60,2147483647,127]

Does this match expectations?

Can you get the output of eg "ceph pg 2.7cd query"? Want to make sure the backfilling versus acting sets and things are correct.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com