Re: Default min_size value for EC pools

Paul Emmerich <paul.emmerich@xxxxxxxx> · Mon, 20 May 2019 21:23:43 +0200

Yeah, the current situation with recovery and min_size is... unfortunate :(

The reason why min_size = k is bad is just that it means you are accepting writes without guaranteeing durability while you are in a degraded state.
A durable storage system should never tell a client "okay, i've written your data" if losing a single disk leads to data loss.

Yes, that is the default behavior of traditional raid 5 and raid 6 systems during rebuild (with 1 or 2 disk failures for raid 5/6), but that doesn't mean it's a good idea.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, May 20, 2019 at 7:37 PM Frank Schilder <frans@xxxxxx> wrote:
This is an issue that is coming up every now and then (for example: https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg50415.html) and I would consider it a very serious one (I will give an example below). A statement like "min_size = k is unsafe and should never be set" deserves a bit more explanation, because ceph is the only storage system I know of, for which k+m redundancy does *not* mean "you can loose up to m disks and still have read-write access". If this is really true then, assuming the same redundancy level, loosing service (client access) is significantly more likely with ceph than with other storage systems. And this has impact on design and storage pricing.

However, some help seems on the way and an, in my opinion, utterly important feature update seems almost finished: https://github.com/ceph/ceph/pull/17619 . It will implement the following:

- recovery I/O happens as long as k shards are available (this is new)

- client I/O will happen as long as min_size shards are available

- recommended is min_size=k+1 (this might be wrong)

This is pretty good and much better than the current behaviour (see below). This pull request also offers useful further information.

Apparently, there is some kind of rare issue with erasure coding in ceph that makes it problematic to use min_size=k. I couldn't find anything better than vague explanations. Quote from the thread above: "Recovery on EC pools requires min_size rather than k shards at this time. There were reasons; they weren't great."

This is actually a situation I was in. I once lost 2 failure domains simultaneously on an 8+2 EC pool and was really surprised that recovery stopped after some time with the worst degraded PGs remaining unfixed. I discovered the min_size=9 (instead of 8) and "ceph health detail" recommended to reduce min_size. Before doing so, I searched the web (I mean, why the default k+1? Come on, there must be a reason.) and found some vague hints about problems with min_size=k during rebuild. This is a really bad corner to be in. A lot of PGs are already critically degraded and the only way forward was to make a bad situation worse, because reducing min_size would immediately enable client I/O in addition to recovery I/O.

It looks like the default of min_size=k+1 will stay, because min_size=k does have some rare issues and these seem not to disappear. (I hope I'm wrong though.) Hence, if min_size=k will remain problematic, the recommendation should be "never to use m=1" instead of "never use min_size=k". In other words, instead of using a 2+1 EC profile, one should use a 4+2 EC profile. If one would like to have secure write access for n disk losses, then m>=n+1.

If this issue remains, in my opinion this should be taken up in the best practices section. In particular, the documentation should not use examples with m=1, this gives the wrong impression. Either min_size=k is safe or not. If it is not, it should never be used anywhere in the documentation.

I hope I marked my opinions and hypotheses clearly and that the links are helpful. If anyone could shed some light on as to why exactly min_size=k+1 is important, I would be grateful.

Best regards,

=================

Frank Schilder

AIT Risø Campus

Bygning 109, rum S14

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com