Re: fast_read in EC pools

Caspar Smit <casparsmit@xxxxxxxxxxx> · Tue, 27 Feb 2018 14:16:05 +0100

Oliver,
Be aware that for k=4,m=2 the min_size will be 5 (k+1), so after a node failure the min_size is already reached.
Any OSD failure beyond the node failure will probably result in some PG's to be become incomplete (I/O Freeze) until the incomplete PG's data is recovered to another OSD in that node.

So please reconsider your statement "one host + x safety" as the x safety (with I/O freeze) is probably not what you want.

Forcing to run with min_size=4 could also be dangerous for other reasons. (there's a reason why min_size = k+1)

Caspar

2018-02-27 0:17 GMT+01:00 Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx>:
Am 27.02.2018 um 00:10 schrieb Gregory Farnum:

> On Mon, Feb 26, 2018 at 2:59 PM Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxx-bonn.de>> wrote:

>

>

>     >     Does this match expectations?

>     >

>     >

>     > Can you get the output of eg "ceph pg 2.7cd query"? Want to make sure the backfilling versus acting sets and things are correct.

>

>     You'll find attached:

>     query_allwell)  Output of "ceph pg 2.7cd query" when all OSDs are up and everything is healthy.

>     query_one_host_out) Output of "ceph pg 2.7cd query" when OSDs 164-195 (one host) are down and out.

>

>

> Yep, that's what we want to see. So when everything's well, we have OSDs 91, 63, 33, 163, 192, 103. That corresponds to chassis 3, 2, 1, 5, 6, 4.

>

> When marking out a host, we have OSDs 91, 63, 33, 163, 123, UNMAPPED. That corresponds to chassis 3, 2, 1, 5, 4, UNMAPPED.

>

> So what's happened is that with the new map, when choosing the home for shard 4, we selected host 4 instead of host 6 (which is gone). And now shard 5 can't map properly. But of course we still have shard 5 available on host 4, so host 4 is going to end up properly owning shard 4, but also just carrying that shard 5 around as a remapped location.

>

> So this is as we expect. Whew.

> -Greg

Understood. Thanks for explaining step by step :-).

It's of course a bit weird that this happens, since in the end, this really means data is moved (or rather, a shard is recreated) and taking up space without increasing redundancy

(well, it might, if it lands on a different OSD than shard 5, but that's not really ensured). I'm unsure if this can be solved "better" in any way.

Anyways, it seems this would be another reason why running with k+m=number of hosts should not be a general recommendation. For us, it's fine for now,

especially since we want to keep the cluster open for later extension with more OSDs, and we do now know the gotchas - and I don't see a better EC configuration at the moment

which would accomodate our wishes (one host + x safety, don't reduce space too much).

So thanks again!

Cheers,

        Oliver

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com