Re: Enclosure power failure pausing client IO till all connected hosts up

Gregory Farnum <greg@xxxxxxxxxxx> · Thu, 9 Jul 2015 12:47:49 +0100



Your first point of troubleshooting is pretty much always to look at
"ceph -s" and see what it says. In this case it's probably telling you
that some PGs are down, and then you can look at why (but perhaps it's
something else).
-Greg

On Thu, Jul 9, 2015 at 12:22 PM, Mallikarjun Biradar
<mallikarjuna.biradar@xxxxxxxxx> wrote:
> Yeah. All OSD's down and monitors still up..
>
> On Thu, Jul 9, 2015 at 4:51 PM, Jan Schermer <jan@xxxxxxxxxxx> wrote:
>> And are the OSDs getting marked down during the outage?
>> Are all the MONs still up?
>>
>> Jan
>>
>>> On 09 Jul 2015, at 13:20, Mallikarjun Biradar <mallikarjuna.biradar@xxxxxxxxx> wrote:
>>>
>>> I have size=2 & min_size=1 and IO is paused till all hosts com back.
>>>
>>> On Thu, Jul 9, 2015 at 4:41 PM, Jan Schermer <jan@xxxxxxxxxxx> wrote:
>>>> What is the min_size setting for the pool? If you have size=2 and min_size=2, then all your data is safe when one replica is down, but the IO is paused. If you want to continue IO you need to set min_size=1.
>>>> But be aware that a single failure after that causes you to lose all the data, you’d have to revert to the other replica if it comes up and works - no idea how that works in ceph but will likely be a PITA to do.
>>>>
>>>> Jan
>>>>
>>>>> On 09 Jul 2015, at 12:42, Mallikarjun Biradar <mallikarjuna.biradar@xxxxxxxxx> wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> Setup details:
>>>>> Two storage enclosures each connected to 4 OSD nodes (Shared storage).
>>>>> Failure domain is Chassis (enclosure) level. Replication count is 2.
>>>>> Each host has allotted with 4 drives.
>>>>>
>>>>> I have active client IO running on cluster. (Random write profile with
>>>>> 4M block size & 64 Queue depth).
>>>>>
>>>>> One of enclosure had power loss. So all OSD's from hosts that are
>>>>> connected to this enclosure went down as expected.
>>>>>
>>>>> But client IO got paused. After some time enclosure & hosts connected
>>>>> to it came up.
>>>>> And all OSD's on that hosts came up.
>>>>>
>>>>> Till this time, cluster was not serving IO. Once all hosts & OSD's
>>>>> pertaining to that enclosure came up, client IO resumed.
>>>>>
>>>>>
>>>>> Can anybody help me why cluster not serving IO during enclosure
>>>>> failure. OR its a bug?
>>>>>
>>>>> -Thanks & regards,
>>>>> Mallikarjun Biradar
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com