On 7/25/19 9:56 AM, Xiaoxi Chen wrote: > The real impact of changing min_size to 1 , is not about the possibility > of losing data ,but how much data it will lost.. in both case you will > lost some data , just how much. > > Let PG X -> (osd A, B, C), min_size = 2, size =3 > In your description, > > T1, OSD A goes down due to upgrade, now the PG is in degraded mode > with (B,C), note that the PG is still active so that there is data > only written to B and C. > > T2 , B goes down to due to disk failure. C is the only one holding the > portion of data between [T1, T2]. > The failure rate of C, in this situation , is independent to whether we > continue writing to C. > > if C failed in T3, > w/o changing min_size, you lost data from [T1, T2] together with > data unavailable from [T2,T3] > changing min_size = 1 , you lost data from [T1, T3] > > But agree, it is a tradeoff, depending on how you believe you wont have > two drive failure in a row within 15 mins upgrade window... > Yes. So with min_size=1 you would need a single disk failure (B in your example) to loose data. If min_size=2 you would need both B and C to fail within that window or when they are recovering A. That is an even smaller chance than a single disk failure. Wido > Wido den Hollander <wido@xxxxxxxx <mailto:wido@xxxxxxxx>> 于2019年7月25 > 日周四 下午3:39写道: > > > > On 7/25/19 9:19 AM, Xiaoxi Chen wrote: > > We had hit this case in production but my solution will be change > > min_size = 1 immediately so that PG back to active right after. > > > > It somewhat tradeoff reliability(durability) with availability during > > that window of 15 mins but if you are certain one out of two "failure" > > is due to recoverable issue, it worth to do so. > > > > That's actually dangerous imho. > > Because while you set min_size=1 you will be mutating data on that > single disk/OSD. > > If the other two OSDs come back recovery will start. Now IF that single > disk/OSD now dies while performing the recovery you have lost data. > > The PG (or PGs) becomes inactive and you either need to perform data > recovery on the failed disk or revert back to the last state. > > I can't take that risk in this situation. > > Wido > > > My 0.02 > > > > Wido den Hollander <wido@xxxxxxxx <mailto:wido@xxxxxxxx> > <mailto:wido@xxxxxxxx <mailto:wido@xxxxxxxx>>> 于2019年7月25 > > 日周四 上午3:48写道: > > > > > > > > On 7/24/19 9:35 PM, Mark Schouten wrote: > > > I’d say the cure is worse than the issue you’re trying to > fix, but > > that’s my two cents. > > > > > > > I'm not completely happy with it either. Yes, the price goes > up and > > latency increases as well. > > > > Right now I'm just trying to find a clever solution to this. > It's a 2k > > OSD cluster and the likelihood of an host or OSD crashing is > reasonable > > while you are performing maintenance on a different host. > > > > All kinds of things have crossed my mind where using size=4 is one > > of them. > > > > Wido > > > > > Mark Schouten > > > > > >> Op 24 jul. 2019 om 21:22 heeft Wido den Hollander > <wido@xxxxxxxx <mailto:wido@xxxxxxxx> > > <mailto:wido@xxxxxxxx <mailto:wido@xxxxxxxx>>> het volgende > geschreven: > > >> > > >> Hi, > > >> > > >> Is anybody using 4x (size=4, min_size=2) replication with Ceph? > > >> > > >> The reason I'm asking is that a customer of mine asked me for a > > solution > > >> to prevent a situation which occurred: > > >> > > >> A cluster running with size=3 and replication over different > > racks was > > >> being upgraded from 13.2.5 to 13.2.6. > > >> > > >> During the upgrade, which involved patching the OS as well, > they > > >> rebooted one of the nodes. During that reboot suddenly a > node in a > > >> different rack rebooted. It was unclear why this happened, but > > the node > > >> was gone. > > >> > > >> While the upgraded node was rebooting and the other node > crashed > > about > > >> 120 PGs were inactive due to min_size=2 > > >> > > >> Waiting for the nodes to come back, recovery to finish it took > > about 15 > > >> minutes before all VMs running inside OpenStack were back > again. > > >> > > >> As you are upgraded or performing any maintenance with > size=3 you > > can't > > >> tolerate a failure of a node as that will cause PGs to go > inactive. > > >> > > >> This made me think about using size=4 and min_size=2 to > prevent this > > >> situation. > > >> > > >> This obviously has implications on write latency and cost, > but it > > would > > >> prevent such a situation. > > >> > > >> Is anybody here running a Ceph cluster with size=4 and > min_size=2 for > > >> this reason? > > >> > > >> Thank you, > > >> > > >> Wido > > >> _______________________________________________ > > >> ceph-users mailing list > > >> ceph-users@xxxxxxxxxxxxxx > <mailto:ceph-users@xxxxxxxxxxxxxx> <mailto:ceph-users@xxxxxxxxxxxxxx > <mailto:ceph-users@xxxxxxxxxxxxxx>> > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> > <mailto:ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com