Re: Enclosure power failure pausing client IO till all connected hosts up

Mallikarjun Biradar <mallikarjuna.biradar@xxxxxxxxx> · Wed, 15 Jul 2015 14:29:58 +0530

Sorry for delay in replying to this, as I was doing some retries on
this issue and summarise.

Tony,
Setup details:
Two storage box (each with 12 drives) , each connected with 4 hosts.
Each host own 3 disk from storage box. Total of 24 OSD's.
Failure domain is at Chassis level.

OSD tree:
 -1      164.2   root default
-7      82.08           chassis chassis1
-2      20.52                   host host-1
0       6.84                            osd.0   up      1
1       6.84                            osd.1   up      1
2       6.84                            osd.2   up      1
-3      20.52                   host host-2
3       6.84                            osd.3   up      1
4       6.84                            osd.4   up      1
5       6.84                            osd.5   up      1
-4      20.52                   host host-3
6       6.84                            osd.6   up      1
7       6.84                            osd.7   up      1
8       6.84                            osd.8   up      1
-5      20.52                   host host-4
9       6.84                            osd.9   up      1
10      6.84                            osd.10  up      1
11      6.84                            osd.11  up      1
-8      82.08           chassis chassis2
-6      20.52                   host host-5
12      6.84                            osd.12  up      1
13      6.84                            osd.13  up      1
14      6.84                            osd.14  up      1
-9      20.52                   host host-6
15      6.84                            osd.15  up      1
16      6.84                            osd.16  up      1
17      6.84                            osd.17  up      1
-10     20.52                   host host-7
18      6.84                            osd.18  up      1
19      6.84                            osd.19  up      1
20      6.84                            osd.20  up      1
-11     20.52                   host host-8
21      6.84                            osd.21  up      1
22      6.84                            osd.22  up      1
23      6.84                            osd.23  up      1

Cluster had ~30TB of data. Client IO is in progress on cluster.
After chassis1 underwent powercycle,
1> all OSD's under chassis2 were intact. Up & running
2> all OSD's under chassis1 were down as expected.

But, client IO was paused untill all the hosts/OSD's under chassis1
comes up. This issue is observed twice out of 5 attempts.

Size is 2 & min_size is 1.

-Thanks,
Mallikarjun

On Thu, Jul 9, 2015 at 8:01 PM, Tony Harris <nethfel@xxxxxxxxx> wrote:
> Sounds to me like you've put yourself at too much risk - *if* I'm reading
> your message right about your configuration, you have multiple hosts
> accessing OSDs that are stored on a single shared box - so if that single
> shared box (single point of failure for multiple nodes) goes down it's
> possible for multiple replicas to disappear at the same time which could
> halt the operation of your cluster if the masters and the replicas are both
> on OSDs within that single shared storage system...
>
> On Thu, Jul 9, 2015 at 5:42 AM, Mallikarjun Biradar
> <mallikarjuna.biradar@xxxxxxxxx> wrote:
>>
>> Hi all,
>>
>> Setup details:
>> Two storage enclosures each connected to 4 OSD nodes (Shared storage).
>> Failure domain is Chassis (enclosure) level. Replication count is 2.
>> Each host has allotted with 4 drives.
>>
>> I have active client IO running on cluster. (Random write profile with
>> 4M block size & 64 Queue depth).
>>
>> One of enclosure had power loss. So all OSD's from hosts that are
>> connected to this enclosure went down as expected.
>>
>> But client IO got paused. After some time enclosure & hosts connected
>> to it came up.
>> And all OSD's on that hosts came up.
>>
>> Till this time, cluster was not serving IO. Once all hosts & OSD's
>> pertaining to that enclosure came up, client IO resumed.
>>
>>
>> Can anybody help me why cluster not serving IO during enclosure
>> failure. OR its a bug?
>>
>> -Thanks & regards,
>> Mallikarjun Biradar
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com