Re: fast_read in EC pools

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 26 Feb 2018 22:15:17 +0000

On Mon, Feb 26, 2018 at 11:48 AM Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:

>     >     The EC pool I am considering is k=4 m=2 with failure domain host, on 6 hosts.

>     >     So necessarily, there is one shard for each host. If one host goes down for a prolonged time,

>     >     there's no "logical" advantage of redistributing things - since whatever you do, with 5 hosts, all PGs will stay in degraded state anyways.

>     >

>     >     However, I noticed Ceph is remapping all PGs, and actively moving data. I presume now this is done for two reasons:

>     >     - The remapping is needed since the primary OSD might be the one which went down. But for remapping (I guess) there's no need to actually move data,

>     >       or is there?

>     >     - The data movement is done to have the "k" shards available.

>     >     If it's really the case that "all shards are equal", then data movement should not occur - or is this a bug / bad feature?

>     >

>     >

>     > If you lose one OSD out of a host, Ceph is going to try and re-replicate the data onto the other OSDs in that host. Your PG size and the CRUSH rule instructs it that the PG needs 6 different OSDs, and those OSDs need to be placed on different hosts.

>     >

>     > You're right that gets very funny if your PG size is equal to the number of hosts. We generally discourage people from running configurations like that.

>

>     Yes. k=4 with m=2 with 6 hosts (i.e. possibility to lose 2 hosts) would be our starting point - since we may add more hosts later (not too soon-ish, but it's not excluded more may come in a year or so),

>     and migrating large EC pools to different settings still seems a bit messy.

>     We can't really afford to reduce available storage significantly more in the current setup, and would like to have the possibility to lose one host (for example for an OS upgrade),

>     and then still lose a few disks in case they fail with bad timing.

>

>     >

>     > Or if you mean that you are losing a host, and the data is shuffling around on the remaining hosts...hrm, that'd be weird. (Perhaps a result of EC pools' "indep" rather than "firstn" crush rules?)

>

>     They are indep, which I think is the default (no manual editing done). I thought the main goal of indep was exactly to reduce data movement.

>     Indeed, it's very funny that data is moved, it certainly does not help to increase redundancy ;-).

>
<snip>

>

> Can you also share the output of "ceph osd crush dump"?

Attached.

Yep, that all looks simple enough.

Do you have any "ceph -s" or other records from when this was occurring? Is it actually deleting or migrating any of the existing shards, or is it just that the shards which were previously on the out'ed OSDs are now getting copied onto the remaining ones?

I think I finally understand what's happening here but would like to be sure. :)
-Greg

(In short: certain straws were previously mapping onto osd.[outed], but now they map onto the remaining OSDs. Because everything's independent, the actual CRUSH mapping for any shard other than the last is now going to map onto a remaining OSD, which would displace the shard it already holds. But the previously-present shard is going to remain "remapped" there because it can't map successfully. So if you lose osd.5, you'll go from a CRUSH mapping like [1,3,5,0,2,4] to [1,3,4,0,2,UNMAPPED], but in reality shards 2 and 5 will both be on OSD 4.)
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com