Down OSDs blocking read requests.

Iain Buclaw <ibuclaw@xxxxxxxxx> · Fri, 18 Nov 2016 12:53:01 +0100

Hi,

Follow up from the suggestion to use any of the following options:

- client_mount_timeout
- rados_mon_op_timeout
- rados_osd_op_timeout

To mitigate the waiting time being blocked on requests.  Is there
really no other way around this?

If two OSDs go down that between them have the both copies of an
object, it would be nice to have clients fail *immediately*.  I've
tried reducing the rados_osd_op_timeout setting to 0.5, but when
things go wrong, it still results in the collapse of the cluster and
all reads from it.

Reducing the rados_osd_op_timeout down to 0.05 seems like a sure way
to cause more false positives.  But in reality, if an OSD operation
can't serve in 150ms, then it's missed the train by over an hour.

-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com