Re: Down OSDs blocking read requests.

John Spray <jspray@xxxxxxxxxx> · Fri, 18 Nov 2016 12:14:11 +0000

On Fri, Nov 18, 2016 at 11:53 AM, Iain Buclaw <ibuclaw@xxxxxxxxx> wrote:
> Hi,
>
> Follow up from the suggestion to use any of the following options:
>
> - client_mount_timeout
> - rados_mon_op_timeout
> - rados_osd_op_timeout
>
> To mitigate the waiting time being blocked on requests.  Is there
> really no other way around this?
>
> If two OSDs go down that between them have the both copies of an
> object, it would be nice to have clients fail *immediately*.  I've
> tried reducing the rados_osd_op_timeout setting to 0.5, but when
> things go wrong, it still results in the collapse of the cluster and
> all reads from it.

Can you be more specific about what is happening when you set
rados_osd_op_timeout?  You're not seeing timeouts at all, operations
are blocking instead?

If you can provide a short librados program that demonstrates an op
blocking indefinitely even when a timeout is set, that would be
useful.

John

>
> Reducing the rados_osd_op_timeout down to 0.05 seems like a sure way
> to cause more false positives.  But in reality, if an OSD operation
> can't serve in 150ms, then it's missed the train by over an hour.
>
> --
> Iain Buclaw
>
> *(p < e ? p++ : p) = (c & 0x0f) + '0';
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com