On 20/05/2019 19:37, Frank Schilder wrote:
This is an issue that is coming up every now and then (for example: https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg50415.html) and I would consider it a very serious one (I will give an example below). A statement like "min_size = k is unsafe and should never be set" deserves a bit more explanation, because ceph is the only storage system I know of, for which k+m redundancy does *not* mean "you can loose up to m disks and still have read-write access". If this is really true then, assuming the same redundancy level, loosing service (client access) is significantly more likely with ceph than with other storage systems. And this has impact on design and storage pricing. However, some help seems on the way and an, in my opinion, utterly important feature update seems almost finished: https://github.com/ceph/ceph/pull/17619 . It will implement the following: - recovery I/O happens as long as k shards are available (this is new) - client I/O will happen as long as min_size shards are available - recommended is min_size=k+1 (this might be wrong) This is pretty good and much better than the current behaviour (see below). This pull request also offers useful further information. Apparently, there is some kind of rare issue with erasure coding in ceph that makes it problematic to use min_size=k. I couldn't find anything better than vague explanations. Quote from the thread above: "Recovery on EC pools requires min_size rather than k shards at this time. There were reasons; they weren't great." This is actually a situation I was in. I once lost 2 failure domains simultaneously on an 8+2 EC pool and was really surprised that recovery stopped after some time with the worst degraded PGs remaining unfixed. I discovered the min_size=9 (instead of 8) and "ceph health detail" recommended to reduce min_size. Before doing so, I searched the web (I mean, why the default k+1? Come on, there must be a reason.) and found some vague hints about problems with min_size=k during rebuild. This is a really bad corner to be in. A lot of PGs are already critically degraded and the only way forward was to make a bad situation worse, because reducing min_size would immediately enable client I/O in addition to recovery I/O. It looks like the default of min_size=k+1 will stay, because min_size=k does have some rare issues and these seem not to disappear. (I hope I'm wrong though.) Hence, if min_size=k will remain problematic, the recommendation should be "never to use m=1" instead of "never use min_size=k". In other words, instead of using a 2+1 EC profile, one should use a 4+2 EC profile. If one would like to have secure write access for n disk losses, then m>=n+1. If this issue remains, in my opinion this should be taken up in the best practices section. In particular, the documentation should not use examples with m=1, this gives the wrong impression. Either min_size=k is safe or not. If it is not, it should never be used anywhere in the documentation. I hope I marked my opinions and hypotheses clearly and that the links are helpful. If anyone could shed some light on as to why exactly min_size=k+1 is important, I would be grateful. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
I think we should separate the issue of recovery versus that of allowing client writes.
recovery: logically if you have k chunks, recovery should be active since you still have no data loss and can re-generate the remaining m chunks. For reasons not clear to many, current recovery activation requires a minimum of min_size chunks rather than k. The patch you mention fixes this issue and tie the recovery process to k rather than min_size.
client writes safety: this is controlled by the min_size value, the default min_size is k+1. The reason is that if you have m failures, you still have no data loss but you are in a critical window since you have no redundancy, any new writes will be stored in k chunks so if you have an additional failure your data written in this time window will be lost. You can well argue that if you are unfortunate enough to reach a m+1 failure situation then you are in deep-deep trouble anyway and should worry much more on the bulk of the data already stored not just the new data: the argument i see here (maybe there is something else) is that there is still significant probability that some of the m failures (hosts/disks) could be recovered so even if at some point you had m+1 failures, not all hope is lost with regards to the existing data, but is guaranteed data loss for the new data.
Due to the first point on recovery being tied to min_size, currently if you are in a situation where you have m failures, the recommendation is to temporarily lower min_size to k and once recovery completes and the cluster is back healthy, to quickly set it back to k+1. This is also a bit questionable since you can argue if the cluster is in the healthy state, min_size has no impact whether set to k or k+1.
/Maged _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com