I understand better thanks to Frank
& Paul messages.
Paul, when min_size=k, is it the same
problem with replicated pool size=2 & min_size=1 ?
On 20/05/2019 21:23, Paul Emmerich
wrote:
Yeah, the current situation with recovery and min_size
is... unfortunate :(
The reason why min_size = k is bad is just that it means
you are accepting writes without guaranteeing durability while
you are in a degraded state.
A durable storage system should never tell a client "okay,
i've written your data" if losing a single disk leads to data
loss.
Yes, that is the default behavior of traditional raid 5 and
raid 6 systems during rebuild (with 1 or 2 disk failures for
raid 5/6), but that doesn't mean it's a good idea.
Paul
--
Paul Emmerich
Looking for help with your Ceph cluster? Contact us at https://croit.io
croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
On Mon, May 20, 2019 at 7:37
PM Frank Schilder < frans@xxxxxx> wrote:
This
is an issue that is coming up every now and then (for example:
https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg50415.html)
and I would consider it a very serious one (I will give an
example below). A statement like "min_size = k is unsafe and
should never be set" deserves a bit more explanation, because
ceph is the only storage system I know of, for which k+m
redundancy does *not* mean "you can loose up to m disks and
still have read-write access". If this is really true then,
assuming the same redundancy level, loosing service (client
access) is significantly more likely with ceph than with other
storage systems. And this has impact on design and storage
pricing.
However, some help seems on the way and an, in my opinion,
utterly important feature update seems almost finished: https://github.com/ceph/ceph/pull/17619
. It will implement the following:
- recovery I/O happens as long as k shards are available (this
is new)
- client I/O will happen as long as min_size shards are
available
- recommended is min_size=k+1 (this might be wrong)
This is pretty good and much better than the current behaviour
(see below). This pull request also offers useful further
information.
Apparently, there is some kind of rare issue with erasure
coding in ceph that makes it problematic to use min_size=k. I
couldn't find anything better than vague explanations. Quote
from the thread above: "Recovery on EC pools requires min_size
rather than k shards at this time. There were reasons; they
weren't great."
This is actually a situation I was in. I once lost 2 failure
domains simultaneously on an 8+2 EC pool and was really
surprised that recovery stopped after some time with the worst
degraded PGs remaining unfixed. I discovered the min_size=9
(instead of 8) and "ceph health detail" recommended to reduce
min_size. Before doing so, I searched the web (I mean, why the
default k+1? Come on, there must be a reason.) and found some
vague hints about problems with min_size=k during rebuild.
This is a really bad corner to be in. A lot of PGs are already
critically degraded and the only way forward was to make a bad
situation worse, because reducing min_size would immediately
enable client I/O in addition to recovery I/O.
It looks like the default of min_size=k+1 will stay, because
min_size=k does have some rare issues and these seem not to
disappear. (I hope I'm wrong though.) Hence, if min_size=k
will remain problematic, the recommendation should be "never
to use m=1" instead of "never use min_size=k". In other words,
instead of using a 2+1 EC profile, one should use a 4+2 EC
profile. If one would like to have secure write access for n
disk losses, then m>=n+1.
If this issue remains, in my opinion this should be taken up
in the best practices section. In particular, the
documentation should not use examples with m=1, this gives the
wrong impression. Either min_size=k is safe or not. If it is
not, it should never be used anywhere in the documentation.
I hope I marked my opinions and hypotheses clearly and that
the links are helpful. If anyone could shed some light on as
to why exactly min_size=k+1 is important, I would be grateful.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
|