Re: Default min_size value for EC pools

Frank Schilder <frans@xxxxxx> · Mon, 20 May 2019 19:53:10 +0000

Dear Paul,

thank you very much for this clarification. I believe also ZFS erasure coded data has this property, which is probably the main cause for the expectation of min_size=k. So, basically, min_size=k means that we are on the security level of traditional redundant storage and this may or may not be good enough - there is no additional risk beyond that. Default in ceph is, it is not good enough. That's perfectly fine - assuming the rebuild gets fixed.

I have a follow-up: I thought that non-redundant writes would almost never occur, because PGs get remapped before accepting writes. To stay with my example of 2 (out of 16) failure domains failing simultaneously, I thought that all PGs will immediately be remapped to fully redundant sets, because there are still 14 failure domains up and only 10 are needed for the 8+2 EC profile. Furthermore, I assumed that writes would not be accepted before a PG is remapped, meaning that every new write will always be fully redundant while recovery I/O slowly recreates the missing objects in the background.

If this "remap first" strategy is not the current behaviour, would it make sense to consider this as an interesting feature? Is there any reason for not remapping all PGs (if possible) prior to starting recovery? It would eliminate the lack of redundancy for new writes (at least for new objects).

Thanks again and best regards,

=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Paul Emmerich <paul.emmerich@xxxxxxxx>
Sent: 20 May 2019 21:23
To: Frank Schilder
Cc: florent; ceph-users
Subject: Re:  Default min_size value for EC pools

Yeah, the current situation with recovery and min_size is... unfortunate :(

The reason why min_size = k is bad is just that it means you are accepting writes without guaranteeing durability while you are in a degraded state.
A durable storage system should never tell a client "okay, i've written your data" if losing a single disk leads to data loss.

Yes, that is the default behavior of traditional raid 5 and raid 6 systems during rebuild (with 1 or 2 disk failures for raid 5/6), but that doesn't mean it's a good idea.

Paul

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io<http://www.croit.io>
Tel: +49 89 1896585 90

On Mon, May 20, 2019 at 7:37 PM Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx>> wrote:
This is an issue that is coming up every now and then (for example: https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg50415.html) and I would consider it a very serious one (I will give an example below). A statement like "min_size = k is unsafe and should never be set" deserves a bit more explanation, because ceph is the only storage system I know of, for which k+m redundancy does *not* mean "you can loose up to m disks and still have read-write access". If this is really true then, assuming the same redundancy level, loosing service (client access) is significantly more likely with ceph than with other storage systems. And this has impact on design and storage pricing.

However, some help seems on the way and an, in my opinion, utterly important feature update seems almost finished: https://github.com/ceph/ceph/pull/17619 . It will implement the following:

- recovery I/O happens as long as k shards are available (this is new)
- client I/O will happen as long as min_size shards are available
- recommended is min_size=k+1 (this might be wrong)

This is pretty good and much better than the current behaviour (see below). This pull request also offers useful further information.

Apparently, there is some kind of rare issue with erasure coding in ceph that makes it problematic to use min_size=k. I couldn't find anything better than vague explanations. Quote from the thread above: "Recovery on EC pools requires min_size rather than k shards at this time. There were reasons; they weren't great."

This is actually a situation I was in. I once lost 2 failure domains simultaneously on an 8+2 EC pool and was really surprised that recovery stopped after some time with the worst degraded PGs remaining unfixed. I discovered the min_size=9 (instead of 8) and "ceph health detail" recommended to reduce min_size. Before doing so, I searched the web (I mean, why the default k+1? Come on, there must be a reason.) and found some vague hints about problems with min_size=k during rebuild. This is a really bad corner to be in. A lot of PGs are already critically degraded and the only way forward was to make a bad situation worse, because reducing min_size would immediately enable client I/O in addition to recovery I/O.

It looks like the default of min_size=k+1 will stay, because min_size=k does have some rare issues and these seem not to disappear. (I hope I'm wrong though.) Hence, if min_size=k will remain problematic, the recommendation should be "never to use m=1" instead of "never use min_size=k". In other words, instead of using a 2+1 EC profile, one should use a 4+2 EC profile. If one would like to have secure write access for n disk losses, then m>=n+1.

If this issue remains, in my opinion this should be taken up in the best practices section. In particular, the documentation should not use examples with m=1, this gives the wrong impression. Either min_size=k is safe or not. If it is not, it should never be used anywhere in the documentation.

I hope I marked my opinions and hypotheses clearly and that the links are helpful. If anyone could shed some light on as to why exactly min_size=k+1 is important, I would be grateful.

Best regards,

=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com