Re: All client writes block when 2 of 3 OSDs down

Gregory Farnum <greg@xxxxxxxxxxx> · Thu, 26 Mar 2015 15:59:26 -0700

On Thu, Mar 26, 2015 at 3:54 PM, Somnath Roy <Somnath.Roy@xxxxxxxxxxx> wrote:
> Greg,
> I think you got me wrong. I am not saying each monitor of a group of 3 should be able to change the map. Here is the scenario.
>
> 1. Cluster up and running with 3 mons (quorum of 3), all fine.
>
> 2. One node (and mon) is down, quorum of 2 , still connecting.
>
> 3. 2 nodes (and 2 mons) are down, should be quorum of 1 now and client should still be able to connect. Isn't it ?

No. The monitors can't tell the difference between dead monitors, and
monitors they can't reach over the network. So they say "there are
three monitors in my map; therefore it requires two to make any
change". That's the case regardless of whether all of them are
running, or only one.

>
> Cluster with single monitor is able to form a quorum and should be working fine. So, why not in case of point 3 ?
> If this is the way Paxos works, should we say that in a cluster with say 3 monitors it should be able to tolerate only one mon failure ?

Yes, that is the case.

>
> Let me know if I am missing a point here.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Gregory Farnum [mailto:greg@xxxxxxxxxxx]
> Sent: Thursday, March 26, 2015 3:41 PM
> To: Somnath Roy
> Cc: Lee Revell; ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  All client writes block when 2 of 3 OSDs down
>
> On Thu, Mar 26, 2015 at 3:36 PM, Somnath Roy <Somnath.Roy@xxxxxxxxxxx> wrote:
>> Got most portion of it, thanks !
>> But, still not able to get when second node is down why with single monitor in the cluster client is not able to connect ?
>> 1 monitor can form a quorum and should be sufficient for a cluster to run.
>
> The whole point of the monitor cluster is to ensure a globally consistent view of the cluster state that will never be reversed by a different group of up nodes. If one monitor (out of three) could make changes to the maps by itself, then there's nothing to prevent all three monitors from staying up but getting a net split, and then each issuing different versions of the osdmaps to whichever clients or OSDs happen to be connected to them.
>
> If you want to get down into the math proofs and things then the Paxos papers do all the proofs. Or you can look at the CAP theorem about the tradeoff between consistency and availability. The monitors are a Paxos cluster and Ceph is a 100% consistent system.
> -Greg
>
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: Gregory Farnum [mailto:greg@xxxxxxxxxxx]
>> Sent: Thursday, March 26, 2015 3:29 PM
>> To: Somnath Roy
>> Cc: Lee Revell; ceph-users@xxxxxxxxxxxxxx
>> Subject: Re:  All client writes block when 2 of 3 OSDs
>> down
>>
>> On Thu, Mar 26, 2015 at 3:22 PM, Somnath Roy <Somnath.Roy@xxxxxxxxxxx> wrote:
>>> Greg,
>>> Couple of dumb question may be.
>>>
>>> 1. If you see , the clients are connecting fine with two monitors in the cluster. 2 monitors can never form a quorum, but, 1 can, so, why with 1 monitor  (which is I guess happening after making 2 nodes down) it is not able to connect ?
>>
>> A quorum is a strict majority of the total membership. 2 monitors can form a quorum just fine if there are either 2 or 3 total membership.
>> (As long as those two agree on every action, it cannot be lost.)
>>
>> We don't *recommend* configuring systems with an even number of
>> monitors, because it increases the number of total possible failures
>> without increasing the number of failures that can be tolerated. (3
>> monitors requires 2 in quorum, 4 does too. Same for 5 and 6, 7 and 8,
>> etc etc.)
>>
>>>
>>> 2. Also, my understanding is while IO is going on *no* monitor interaction will be on that path, so, why the client io will be stopped because the monitor quorum is not there ? If the min_size =1 is properly set it should able to serve IO as long as 1 OSD (node) is up, isn't it ?
>>
>> Well, the remaining OSD won't be able to process IO because it's lost
>> its peers, and it can't reach any monitors to do updates or get new
>> maps. (Monitors which are not in quorum will not allow clients to
>> connect.)
>> The clients will eventually stop serving IO if they know they can't reach a monitor, although I don't remember exactly how that's triggered.
>>
>> In this particular case, though, the client probably just tried to do an op against the dead osd, realized it couldn't, and tried to fetch a map from the monitors. When that failed it went into search mode, which is what the logs are showing you.
>> -Greg
>>
>>>
>>> Thanks & Regards
>>> Somnath
>>>
>>> -----Original Message-----
>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
>>> Of Gregory Farnum
>>> Sent: Thursday, March 26, 2015 2:40 PM
>>> To: Lee Revell
>>> Cc: ceph-users@xxxxxxxxxxxxxx
>>> Subject: Re:  All client writes block when 2 of 3 OSDs
>>> down
>>>
>>> On Thu, Mar 26, 2015 at 2:30 PM, Lee Revell <rlrevell@xxxxxxxxx> wrote:
>>>> On Thu, Mar 26, 2015 at 4:40 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
>>>>>
>>>>> Has the OSD actually been detected as down yet?
>>>>>
>>>>
>>>> I believe it has, however I can't directly check because "ceph health"
>>>> starts to hang when I down the second node.
>>>
>>> Oh. You need to keep a quorum of your monitors running (just the monitor processes, not of everything in the system) or nothing at all is going to work. That's how we prevent split brain issues.
>>>
>>>>
>>>>>
>>>>> You'll also need to set that min size on your existing pools ("ceph
>>>>> osd pool <pool> set min_size 1" or similar) to change their
>>>>> behavior; the config option only takes effect for newly-created
>>>>> pools. (Thus the
>>>>> "default".)
>>>>
>>>>
>>>> I've done this, however the behavior is the same:
>>>>
>>>> $ for f in `ceph osd lspools | sed 's/[0-9]//g' | sed 's/,//g'`; do
>>>> ceph osd pool set $f min_size 1; done set pool 0 min_size to 1 set
>>>> pool 1 min_size to 1 set pool 2 min_size to 1 set pool 3 min_size to
>>>> 1 set pool 4 min_size to 1 set pool 5 min_size to 1 set pool 6
>>>> min_size to 1 set pool 7 min_size to 1
>>>>
>>>> $ ceph -w
>>>>     cluster db460aa2-5129-4aaa-8b2e-43eac727124e
>>>>      health HEALTH_WARN 1 mons down, quorum 0,1 ceph-node-1,ceph-node-2
>>>>      monmap e3: 3 mons at
>>>> {ceph-node-1=192.168.122.121:6789/0,ceph-node-2=192.168.122.131:6789
>>>> /
>>>> 0 ,ceph-node-3=192.168.122.141:6789/0},
>>>> election epoch 194, quorum 0,1 ceph-node-1,ceph-node-2
>>>>      mdsmap e94: 1/1/1 up {0=ceph-node-1=up:active}
>>>>      osdmap e362: 3 osds: 2 up, 2 in
>>>>       pgmap v5913: 840 pgs, 8 pools, 7441 MB data, 994 objects
>>>>             25329 MB used, 12649 MB / 40059 MB avail
>>>>                  840 active+clean
>>>>
>>>> 2015-03-26 17:23:56.009938 mon.0 [INF] pgmap v5913: 840 pgs: 840
>>>> active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail
>>>> 2015-03-26 17:25:51.042802 mon.0 [INF] pgmap v5914: 840 pgs: 840
>>>> active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB
>>>> active+avail;
>>>> active+0 B/s
>>>> rd, 260 kB/s wr, 13 op/s
>>>> 2015-03-26 17:25:56.046491 mon.0 [INF] pgmap v5915: 840 pgs: 840
>>>> active+clean; 7441 MB data, 25333 MB used, 12645 MB / 40059 MB
>>>> active+avail;
>>>> active+0 B/s
>>>> rd, 943 kB/s wr, 38 op/s
>>>> 2015-03-26 17:26:01.058167 mon.0 [INF] pgmap v5916: 840 pgs: 840
>>>> active+clean; 7441 MB data, 25335 MB used, 12643 MB / 40059 MB
>>>> active+avail;
>>>> active+0 B/s
>>>> rd, 10699 kB/s wr, 621 op/s
>>>>
>>>> <this is where i kill the second OSD>
>>>>
>>>> 2015-03-26 17:26:26.778461 7f4ebeffd700  0 monclient: hunting for
>>>> new mon
>>>> 2015-03-26 17:26:30.701099 7f4ec45f5700  0 --
>>>> 192.168.122.111:0/1007741 >>
>>>> 192.168.122.141:6789/0 pipe(0x7f4ec0023200 sd=3 :0 s=1 pgs=0 cs=0
>>>> l=1 c=0x7f4ec0023490).fault
>>>> 2015-03-26 17:26:42.701154 7f4ec44f4700  0 --
>>>> 192.168.122.111:0/1007741 >>
>>>> 192.168.122.131:6789/0 pipe(0x7f4ec00251b0 sd=3 :0 s=1 pgs=0 cs=0
>>>> l=1 c=0x7f4ec0025440).fault
>>>>
>>>> And all writes block until I bring back an OSD.
>>>>
>>>> Lee
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>> ________________________________
>>>
>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com