Re: fast_read in EC pools

Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> · Tue, 27 Feb 2018 17:31:50 +0100

Dear Caspar,

many thanks for the link! 

Now I'm pondering - the problem is that we'd certainly want to keep m=2, but reducing k=4 to k=3 already means significant reduction of storage. 
We'll elaborate on the probability of new OSD hosts being added in the near future and consider this before deciding on the final configuration. 

Many thanks again, this surely helps a lot!
	Oliver

Am 27.02.2018 um 14:45 schrieb Caspar Smit:
> Oliver,
> 
> Here's the commit info:
> 
> https://github.com/ceph/ceph/commit/48e40fcde7b19bab98821ab8d604eab920591284
> 
> Caspar
> 
> 2018-02-27 14:28 GMT+01:00 Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>>:
> 
>     Am 27.02.2018 um 14:16 schrieb Caspar Smit:
>     > Oliver,
>     >
>     > Be aware that for k=4,m=2 the min_size will be 5 (k+1), so after a node failure the min_size is already reached.
>     > Any OSD failure beyond the node failure will probably result in some PG's to be become incomplete (I/O Freeze) until the incomplete PG's data is recovered to another OSD in that node.
>     >
>     > So please reconsider your statement "one host + x safety" as the x safety (with I/O freeze) is probably not what you want.
>     >
>     > Forcing to run with min_size=4 could also be dangerous for other reasons. (there's a reason why min_size = k+1)
> 
>     Thanks for pointing this out!
>     Yes, indeed, in case we need to take down a host for a longer period (we would hope this never has to happen for > 24 hours... but you never know),
>     and in case disks start to fail, we would indeed have to degrade to min_size=4 to keep running.
> 
>     What exactly are the implications?
>     It should still be possible to ensure the data is not corrupt (with the checksums), and recovery to k+1 copies should start automatically once a disk fails -
>     so what's the actual implication?
>     Of course pg repair can not work in that case (if a PG for which the additional disk failed is corrupted),
>     but in general, when there's the need to reinstall a host, we'd try to bring it back with OSD data intact -
>     which should then allow to postpone the repair until that point.
> 
>     Is there a danger I miss in my reasoning?
> 
>     Cheers and many thanks!
>             Oliver
> 
>     >
>     > Caspar
>     >
>     > 2018-02-27 0:17 GMT+01:00 Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx> <mailto:freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>>>:
>     >
>     >     Am 27.02.2018 um 00:10 schrieb Gregory Farnum:
>     >     > On Mon, Feb 26, 2018 at 2:59 PM Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx> <mailto:freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>> <mailto:freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx> <mailto:freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>>>> wrote:
>     >     >
>     >     >
>     >     >     >     Does this match expectations?
>     >     >     >
>     >     >     >
>     >     >     > Can you get the output of eg "ceph pg 2.7cd query"? Want to make sure the backfilling versus acting sets and things are correct.
>     >     >
>     >     >     You'll find attached:
>     >     >     query_allwell)  Output of "ceph pg 2.7cd query" when all OSDs are up and everything is healthy.
>     >     >     query_one_host_out) Output of "ceph pg 2.7cd query" when OSDs 164-195 (one host) are down and out.
>     >     >
>     >     >
>     >     > Yep, that's what we want to see. So when everything's well, we have OSDs 91, 63, 33, 163, 192, 103. That corresponds to chassis 3, 2, 1, 5, 6, 4.
>     >     >
>     >     > When marking out a host, we have OSDs 91, 63, 33, 163, 123, UNMAPPED. That corresponds to chassis 3, 2, 1, 5, 4, UNMAPPED.
>     >     >
>     >     > So what's happened is that with the new map, when choosing the home for shard 4, we selected host 4 instead of host 6 (which is gone). And now shard 5 can't map properly. But of course we still have shard 5 available on host 4, so host 4 is going to end up properly owning shard 4, but also just carrying that shard 5 around as a remapped location.
>     >     >
>     >     > So this is as we expect. Whew.
>     >     > -Greg
>     >
>     >     Understood. Thanks for explaining step by step :-).
>     >     It's of course a bit weird that this happens, since in the end, this really means data is moved (or rather, a shard is recreated) and taking up space without increasing redundancy
>     >     (well, it might, if it lands on a different OSD than shard 5, but that's not really ensured). I'm unsure if this can be solved "better" in any way.
>     >
>     >     Anyways, it seems this would be another reason why running with k+m=number of hosts should not be a general recommendation. For us, it's fine for now,
>     >     especially since we want to keep the cluster open for later extension with more OSDs, and we do now know the gotchas - and I don't see a better EC configuration at the moment
>     >     which would accomodate our wishes (one host + x safety, don't reduce space too much).
>     >
>     >     So thanks again!
>     >
>     >     Cheers,
>     >             Oliver
>     >
>     >
>     >     _______________________________________________
>     >     ceph-users mailing list
>     >     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> <mailto:ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>>
>     >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>>
>     >
>     >
> 
> 

-- 
Oliver Freyermuth
Universität Bonn
Physikalisches Institut, Raum 1.047
Nußallee 12
53115 Bonn
--
Tel.: +49 228 73 2367
Fax:  +49 228 73 7869
--

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com