Re: Slow responding OSDs are not OUTed and cause RBD client IO hangs

Jan Schermer <jan@xxxxxxxxxxx> · Mon, 24 Aug 2015 11:46:38 +0200

This can be tuned in the iSCSI initiation on VMware - look in advanced settings on your ESX hosts (at least if you use the software initiator).

Jan

> On 23 Aug 2015, at 21:28, Nick Fisk <nick@xxxxxxxxxx> wrote:
> 
> Hi Alex,
> 
> Currently RBD+LIO+ESX is broken.
> 
> The problem is caused by the RBD device not handling device aborts properly
> causing LIO and ESXi to enter a death spiral together.
> 
> If something in the Ceph cluster causes an IO to take longer than 10
> seconds(I think!!!) ESXi submits an iSCSI abort message. Once this happens,
> as you have seen it never recovers.
> 
> Mike Christie from Redhat is doing a lot of work on this currently, so
> hopefully in the future there will be a direct RBD interface into LIO and it
> will all work much better.
> 
> Either tgt or SCST seem to be pretty stable in testing.
> 
> Nick
> 
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
>> Alex Gorbachev
>> Sent: 23 August 2015 02:17
>> To: ceph-users <ceph-users@xxxxxxxxxxxxxx>
>> Subject:  Slow responding OSDs are not OUTed and cause RBD
>> client IO hangs
>> 
>> Hello, this is an issue we have been suffering from and researching along
>> with a good number of other Ceph users, as evidenced by the recent posts.
>> In our specific case, these issues manifest themselves in a RBD -> iSCSI
> LIO ->
>> ESXi configuration, but the problem is more general.
>> 
>> When there is an issue on OSD nodes (examples: network hangs/blips, disk
>> HBAs failing, driver issues, page cache/XFS issues), some OSDs respond
>> slowly or with significant delays.  ceph osd perf does not show this,
> neither
>> does ceph osd tree, ceph -s / ceph -w.  Instead, the RBD IO hangs to a
> point
>> where the client times out, crashes or displays other unsavory behavior -
>> operationally this crashes production processes.
>> 
>> Today in our lab we had a disk controller issue, which brought an OSD node
>> down.  Upon restart, the OSDs started up and rejoined into the cluster.
>> However, immediately all IOs started hanging for a long time and aborts
> from
>> ESXi -> LIO were not succeeding in canceling these IOs.  The only warning
> I
>> could see was:
>> 
>> root@lab2-mon1:/var/log/ceph# ceph health detail HEALTH_WARN 30
>> requests are blocked > 32 sec;
>> 1 osds have slow requests 30 ops are blocked > 2097.15 sec
>> 30 ops are blocked > 2097.15 sec on osd.4
>> 1 osds have slow requests
>> 
>> However, ceph osd perf is not showing high latency on osd 4:
>> 
>> root@lab2-mon1:/var/log/ceph# ceph osd perf osd fs_commit_latency(ms)
>> fs_apply_latency(ms)
>>  0                     0                   13
>>  1                     0                    0
>>  2                     0                    0
>>  3                   172                  208
>>  4                     0                    0
>>  5                     0                    0
>>  6                     0                    1
>>  7                     0                    0
>>  8                   174                  819
>>  9                     6                   10
>> 10                     0                    1
>> 11                     0                    1
>> 12                     3                    5
>> 13                     0                    1
>> 14                     7                   23
>> 15                     0                    1
>> 16                     0                    0
>> 17                     5                    9
>> 18                     0                    1
>> 19                    10                   18
>> 20                     0                    0
>> 21                     0                    0
>> 22                     0                    1
>> 23                     5                   10
>> 
>> SMART state for osd 4 disk is OK.  The OSD in up and in:
>> 
>> root@lab2-mon1:/var/log/ceph# ceph osd tree
>> ID WEIGHT   TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY
>> -8        0 root ssd
>> -7 14.71997 root platter
>> -3  7.12000     host croc3
>> 22  0.89000         osd.22      up  1.00000          1.00000
>> 15  0.89000         osd.15      up  1.00000          1.00000
>> 16  0.89000         osd.16      up  1.00000          1.00000
>> 13  0.89000         osd.13      up  1.00000          1.00000
>> 18  0.89000         osd.18      up  1.00000          1.00000
>> 8  0.89000         osd.8       up  1.00000          1.00000
>> 11  0.89000         osd.11      up  1.00000          1.00000
>> 20  0.89000         osd.20      up  1.00000          1.00000
>> -4  0.47998     host croc2
>> 10  0.06000         osd.10      up  1.00000          1.00000
>> 12  0.06000         osd.12      up  1.00000          1.00000
>> 14  0.06000         osd.14      up  1.00000          1.00000
>> 17  0.06000         osd.17      up  1.00000          1.00000
>> 19  0.06000         osd.19      up  1.00000          1.00000
>> 21  0.06000         osd.21      up  1.00000          1.00000
>> 9  0.06000         osd.9       up  1.00000          1.00000
>> 23  0.06000         osd.23      up  1.00000          1.00000
>> -2  7.12000     host croc1
>> 7  0.89000         osd.7       up  1.00000          1.00000
>> 2  0.89000         osd.2       up  1.00000          1.00000
>> 6  0.89000         osd.6       up  1.00000          1.00000
>> 1  0.89000         osd.1       up  1.00000          1.00000
>> 5  0.89000         osd.5       up  1.00000          1.00000
>> 0  0.89000         osd.0       up  1.00000          1.00000
>> 4  0.89000         osd.4       up  1.00000          1.00000
>> 3  0.89000         osd.3       up  1.00000          1.00000
>> 
>> How can we proactively detect this condition?  Is there anything I can run
>> that will output all slow OSDs?
>> 
>> Regards,
>> Alex
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com