Re: Enclosure power failure pausing client IO till all connected hosts up

Jan Schermer <jan@xxxxxxxxxxx> · Thu, 9 Jul 2015 13:11:02 +0200

What is the min_size setting for the pool? If you have size=2 and min_size=2, then all your data is safe when one replica is down, but the IO is paused. If you want to continue IO you need to set min_size=1.
But be aware that a single failure after that causes you to lose all the data, you’d have to revert to the other replica if it comes up and works - no idea how that works in ceph but will likely be a PITA to do.

Jan

> On 09 Jul 2015, at 12:42, Mallikarjun Biradar <mallikarjuna.biradar@xxxxxxxxx> wrote:
> 
> Hi all,
> 
> Setup details:
> Two storage enclosures each connected to 4 OSD nodes (Shared storage).
> Failure domain is Chassis (enclosure) level. Replication count is 2.
> Each host has allotted with 4 drives.
> 
> I have active client IO running on cluster. (Random write profile with
> 4M block size & 64 Queue depth).
> 
> One of enclosure had power loss. So all OSD's from hosts that are
> connected to this enclosure went down as expected.
> 
> But client IO got paused. After some time enclosure & hosts connected
> to it came up.
> And all OSD's on that hosts came up.
> 
> Till this time, cluster was not serving IO. Once all hosts & OSD's
> pertaining to that enclosure came up, client IO resumed.
> 
> 
> Can anybody help me why cluster not serving IO during enclosure
> failure. OR its a bug?
> 
> -Thanks & regards,
> Mallikarjun Biradar
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com