Re: Anybody using 4x (size=4) replication?

Xavier Trilla <xavier.trilla@xxxxxxxxxxx> · Thu, 25 Jul 2019 11:51:19 +0000

We had a similar situation, with one machine reseting when we restarted another one.

I’m not 100% sure why it happened, but I would bet it was related to several thousand client connections migrating from the machine we restarted to another one.

We have a similar setup than yours, and if you check the number of connections you have per machine they will be really high.

When you restart that machine all that connections have to be migrated to other machines, and I would say that may cause an error at a kernel level.

My suggestion -is what we do nowadays- would be to slowly stop all the OSDs on the machine your are about to restart before your restart it.

Cheers!

El 25 jul 2019, a les 11:01, Wido den Hollander <wido@xxxxxxxx> va escriure:

> 
> 
>> On 7/25/19 9:56 AM, Xiaoxi Chen wrote:
>> The real impact of changing min_size to 1 , is not about the possibility
>> of losing data ,but how much data it will lost.. in both case you will
>> lost some data , just how much.
>> 
>> Let PG X -> (osd A, B, C), min_size = 2, size =3
>> In your description, 
>> 
>> T1,   OSD A goes down due to upgrade, now the PG is in degraded mode
>> with   (B,C),  note that the PG is still active so that there is data
>> only written to B and C.  
>> 
>> T2 , B goes down to due to disk failure.  C is the only one holding the
>> portion of data between [T1, T2].
>> The failure rate of C, in this situation  , is independent to whether we
>> continue writing to C. 
>> 
>> if C failed in T3,  
>>     w/o changing min_size, you  lost data from [T1, T2] together with
>> data unavailable from [T2,T3]
>>     changing min_size = 1 , you lost data from [T1, T3] 
>> 
>> But agree, it is a tradeoff,  depending on how you believe you wont have
>> two drive failure in a row within 15 mins upgrade window...
> 
> Yes. So with min_size=1 you would need a single disk failure (B in your
> example) to loose data.
> 
> If min_size=2 you would need both B and C to fail within that window or
> when they are recovering A.
> 
> That is an even smaller chance than a single disk failure.
> 
> Wido
> 
>> Wido den Hollander <wido@xxxxxxxx <mailto:wido@xxxxxxxx>> 于2019年7月25
>> 日周四 下午3:39写道：
>> 
>> 
>> 
>>>    On 7/25/19 9:19 AM, Xiaoxi Chen wrote:
>>> We had hit this case in production but my solution will be change
>>> min_size = 1 immediately so that PG back to active right after.
>>> 
>>> It somewhat tradeoff reliability(durability) with availability during
>>> that window of 15 mins but if you are certain one out of two "failure"
>>> is due to recoverable issue, it worth to do so.
>> 
>>    That's actually dangerous imho.
>> 
>>    Because while you set min_size=1 you will be mutating data on that
>>    single disk/OSD.
>> 
>>    If the other two OSDs come back recovery will start. Now IF that single
>>    disk/OSD now dies while performing the recovery you have lost data.
>> 
>>    The PG (or PGs) becomes inactive and you either need to perform data
>>    recovery on the failed disk or revert back to the last state.
>> 
>>    I can't take that risk in this situation.
>> 
>>    Wido
>> 
>>> My 0.02
>>> 
>>> Wido den Hollander <wido@xxxxxxxx <mailto:wido@xxxxxxxx>
>>    <mailto:wido@xxxxxxxx <mailto:wido@xxxxxxxx>>> 于2019年7月25
>>> 日周四 上午3:48写道：
>>> 
>>> 
>>> 
>>>      On 7/24/19 9:35 PM, Mark Schouten wrote:
>>>      > I’d say the cure is worse than the issue you’re trying to
>>    fix, but
>>>      that’s my two cents.
>>>      >
>>> 
>>>      I'm not completely happy with it either. Yes, the price goes
>>    up and
>>>      latency increases as well.
>>> 
>>>      Right now I'm just trying to find a clever solution to this.
>>    It's a 2k
>>>      OSD cluster and the likelihood of an host or OSD crashing is
>>    reasonable
>>>      while you are performing maintenance on a different host.
>>> 
>>>      All kinds of things have crossed my mind where using size=4 is one
>>>      of them.
>>> 
>>>      Wido
>>> 
>>>      > Mark Schouten
>>>      >
>>>      >> Op 24 jul. 2019 om 21:22 heeft Wido den Hollander
>>    <wido@xxxxxxxx <mailto:wido@xxxxxxxx>
>>>      <mailto:wido@xxxxxxxx <mailto:wido@xxxxxxxx>>> het volgende
>>    geschreven:
>>>      >>
>>>      >> Hi,
>>>      >>
>>>      >> Is anybody using 4x (size=4, min_size=2) replication with Ceph?
>>>      >>
>>>      >> The reason I'm asking is that a customer of mine asked me for a
>>>      solution
>>>      >> to prevent a situation which occurred:
>>>      >>
>>>      >> A cluster running with size=3 and replication over different
>>>      racks was
>>>      >> being upgraded from 13.2.5 to 13.2.6.
>>>      >>
>>>      >> During the upgrade, which involved patching the OS as well,
>>    they
>>>      >> rebooted one of the nodes. During that reboot suddenly a
>>    node in a
>>>      >> different rack rebooted. It was unclear why this happened, but
>>>      the node
>>>      >> was gone.
>>>      >>
>>>      >> While the upgraded node was rebooting and the other node
>>    crashed
>>>      about
>>>      >> 120 PGs were inactive due to min_size=2
>>>      >>
>>>      >> Waiting for the nodes to come back, recovery to finish it took
>>>      about 15
>>>      >> minutes before all VMs running inside OpenStack were back
>>    again.
>>>      >>
>>>      >> As you are upgraded or performing any maintenance with
>>    size=3 you
>>>      can't
>>>      >> tolerate a failure of a node as that will cause PGs to go
>>    inactive.
>>>      >>
>>>      >> This made me think about using size=4 and min_size=2 to
>>    prevent this
>>>      >> situation.
>>>      >>
>>>      >> This obviously has implications on write latency and cost,
>>    but it
>>>      would
>>>      >> prevent such a situation.
>>>      >>
>>>      >> Is anybody here running a Ceph cluster with size=4 and
>>    min_size=2 for
>>>      >> this reason?
>>>      >>
>>>      >> Thank you,
>>>      >>
>>>      >> Wido
>>>      >> _______________________________________________
>>>      >> ceph-users mailing list
>>>      >> ceph-users@xxxxxxxxxxxxxx
>>    <mailto:ceph-users@xxxxxxxxxxxxxx> <mailto:ceph-users@xxxxxxxxxxxxxx
>>    <mailto:ceph-users@xxxxxxxxxxxxxx>>
>>>      >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>      _______________________________________________
>>>      ceph-users mailing list
>>>      ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>>    <mailto:ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>>
>>>      http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com