Sorry for delay in replying to this, as I was doing some retries on this issue and summarise. Tony, Setup details: Two storage box (each with 12 drives) , each connected with 4 hosts. Each host own 3 disk from storage box. Total of 24 OSD's. Failure domain is at Chassis level. OSD tree: -1 164.2 root default -7 82.08 chassis chassis1 -2 20.52 host host-1 0 6.84 osd.0 up 1 1 6.84 osd.1 up 1 2 6.84 osd.2 up 1 -3 20.52 host host-2 3 6.84 osd.3 up 1 4 6.84 osd.4 up 1 5 6.84 osd.5 up 1 -4 20.52 host host-3 6 6.84 osd.6 up 1 7 6.84 osd.7 up 1 8 6.84 osd.8 up 1 -5 20.52 host host-4 9 6.84 osd.9 up 1 10 6.84 osd.10 up 1 11 6.84 osd.11 up 1 -8 82.08 chassis chassis2 -6 20.52 host host-5 12 6.84 osd.12 up 1 13 6.84 osd.13 up 1 14 6.84 osd.14 up 1 -9 20.52 host host-6 15 6.84 osd.15 up 1 16 6.84 osd.16 up 1 17 6.84 osd.17 up 1 -10 20.52 host host-7 18 6.84 osd.18 up 1 19 6.84 osd.19 up 1 20 6.84 osd.20 up 1 -11 20.52 host host-8 21 6.84 osd.21 up 1 22 6.84 osd.22 up 1 23 6.84 osd.23 up 1 Cluster had ~30TB of data. Client IO is in progress on cluster. After chassis1 underwent powercycle, 1> all OSD's under chassis2 were intact. Up & running 2> all OSD's under chassis1 were down as expected. But, client IO was paused untill all the hosts/OSD's under chassis1 comes up. This issue is observed twice out of 5 attempts. Size is 2 & min_size is 1. -Thanks, Mallikarjun On Thu, Jul 9, 2015 at 8:01 PM, Tony Harris <nethfel@xxxxxxxxx> wrote: > Sounds to me like you've put yourself at too much risk - *if* I'm reading > your message right about your configuration, you have multiple hosts > accessing OSDs that are stored on a single shared box - so if that single > shared box (single point of failure for multiple nodes) goes down it's > possible for multiple replicas to disappear at the same time which could > halt the operation of your cluster if the masters and the replicas are both > on OSDs within that single shared storage system... > > On Thu, Jul 9, 2015 at 5:42 AM, Mallikarjun Biradar > <mallikarjuna.biradar@xxxxxxxxx> wrote: >> >> Hi all, >> >> Setup details: >> Two storage enclosures each connected to 4 OSD nodes (Shared storage). >> Failure domain is Chassis (enclosure) level. Replication count is 2. >> Each host has allotted with 4 drives. >> >> I have active client IO running on cluster. (Random write profile with >> 4M block size & 64 Queue depth). >> >> One of enclosure had power loss. So all OSD's from hosts that are >> connected to this enclosure went down as expected. >> >> But client IO got paused. After some time enclosure & hosts connected >> to it came up. >> And all OSD's on that hosts came up. >> >> Till this time, cluster was not serving IO. Once all hosts & OSD's >> pertaining to that enclosure came up, client IO resumed. >> >> >> Can anybody help me why cluster not serving IO during enclosure >> failure. OR its a bug? >> >> -Thanks & regards, >> Mallikarjun Biradar >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com