Re: Slow responding OSDs are not OUTed and cause RBD client IO hangs

Nick Fisk <nick@xxxxxxxxxx> · Mon, 24 Aug 2015 21:02:39 +0100

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Alex Gorbachev
> Sent: 24 August 2015 18:06
> To: Jan Schermer <jan@xxxxxxxxxxx>
> Cc: ceph-users@xxxxxxxxxxxxxx; Nick Fisk <nick@xxxxxxxxxx>
> Subject: Re:  Slow responding OSDs are not OUTed and cause
> RBD client IO hangs
> 
> HI Jan,
> 
> On Mon, Aug 24, 2015 at 12:40 PM, Jan Schermer <jan@xxxxxxxxxxx> wrote:
> > I never actually set up iSCSI with VMware, I just had to research
various
> VMware storage options when we had a SAN-probelm at a former job... But I
> can take a look at it again if you want me to.
> 
> Thank you, I don't want to waste your time as I have asked Vmware TAP to
> research that - I will communicate back anything with which they respond.
> 
> >
> > Is it realy deadlocked when this issue occurs?
> > What I think is partly responsible for this situation is that the iSCSI
LUN
> queues fill up and that's what actually kills your IO - VMware lowers
queue
> depth to 1 in that situation and it can take a really long time to recover
> (especially if one of the LUNs  on the target constantly has problems, or
> when heavy IO hammers the adapter) - you should never fill this queue,
> ever.
> > iSCSI will likely be innocent victim in the chain, not the cause of the
issues.
> 
> Completely agreed, so iSCSI's job then is to properly communicate to the
> initiator that it cannot do what it is asked to do and quit the IO.

It's not a queue full or queue throttling issue. ESXi detects a slow IO
which I believe is when an IO takes longer than 10 seconds, it then tries to
send an abort message to the target so it can then retry. However the RBD
client doesn't handle the abort message passed to it from LIO. I'm not sure
what quite happens next but between LIO and ESXi neither makes the decision
to ignore the abort and so both enter a standoff with each other.

> 
> >
> > Ceph should gracefully handle all those situations, you just need to set
the
> timeouts right. I have it set so that whatever happens the OSD can only
delay
> work for 40s and then it is marked down - at that moment all IO start
flowing
> again.
> 
> What setting in ceph do you use to do that?  is that
> mon_osd_down_out_interval?  I think stopping slow OSDs is the answer to
> the root of the problem - so far I only know to do "ceph osd perf"
> and look at latencies.
> 

You can maybe adjust some of the timeouts to make Ceph pause for less time
to hopefully make sure all IO is processed in under 10s, but you increase
the risk of OSD's randomly dropping out and there are probably still quite a
few cases where IO could still take longer than 10s.

> >
> > You should take this to VMware support, they should be able to tell
> whether the problem is in iSCSI target (then you can take a look at how
that
> behaves) or in the initiator settings. Though in my experience after two
visits
> from their "foremost experts" I had to google everything myself because
> they were clueless - YMMV.
> 
> I am hoping the TAP Elite team can do better...but we'll see...
> 
> >
> > The root cause is however slow ops in Ceph, and I have no idea why you'd
> have them if the OSDs come back up - maybe one of them is really
> deadlocked or backlogged in some way? I found that when OSDs are "dead
> but up" they don't respond to "ceph tell osd.xxx ..." so try if they all
respond
> in a timely manner, that should help pinpoint the bugger.
> 
> I think I know in this case - there are some PCIe AER/Bus errors and TLP
> Header messages strewing across the console of one OSD machine - ceph
> osd perf showing latencies aboce a second per OSD, but only when IO is
> done to those OSDs.  I am thankful this is not production storage, but
worried
> of this situation in production - the OSDs are staying up and in, but
their
> latencies are slowing clusterwide IO to a crawl.  I am trying to envision
this
> situation in production and how would one find out what is slowing
> everything down without guessing.
> 
> Regards,
> Alex
> 
> 
> >
> > Jan
> >
> >
> >> On 24 Aug 2015, at 18:26, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx>
> wrote:
> >>
> >>> This can be tuned in the iSCSI initiation on VMware - look in advanced
> settings on your ESX hosts (at least if you use the software initiator).
> >>
> >> Thanks, Jan. I asked this question of Vmware as well, I think the
> >> problem is specific to a given iSCSI session, so wondering if that's
> >> strictly the job of the target?  Do you know of any specific SCSI
> >> settings that mitigate this kind of issue?  Basically, give up on a
> >> session and terminate it and start a new one should an RBD not
> >> respond?
> >>
> >> As I understand, RBD simply never gives up.  If an OSD does not
> >> respond but is still technically up and in, Ceph will retry IOs
> >> forever.  I think RBD and Ceph need a timeout mechanism for this.
> >>
> >> Best regards,
> >> Alex
> >>
> >>> Jan
> >>>
> >>>
> >>>> On 23 Aug 2015, at 21:28, Nick Fisk <nick@xxxxxxxxxx> wrote:
> >>>>
> >>>> Hi Alex,
> >>>>
> >>>> Currently RBD+LIO+ESX is broken.
> >>>>
> >>>> The problem is caused by the RBD device not handling device aborts
> >>>> properly causing LIO and ESXi to enter a death spiral together.
> >>>>
> >>>> If something in the Ceph cluster causes an IO to take longer than
> >>>> 10 seconds(I think!!!) ESXi submits an iSCSI abort message. Once
> >>>> this happens, as you have seen it never recovers.
> >>>>
> >>>> Mike Christie from Redhat is doing a lot of work on this currently,
> >>>> so hopefully in the future there will be a direct RBD interface
> >>>> into LIO and it will all work much better.
> >>>>
> >>>> Either tgt or SCST seem to be pretty stable in testing.
> >>>>
> >>>> Nick
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> >>>>> Behalf Of Alex Gorbachev
> >>>>> Sent: 23 August 2015 02:17
> >>>>> To: ceph-users <ceph-users@xxxxxxxxxxxxxx>
> >>>>> Subject:  Slow responding OSDs are not OUTed and cause
> >>>>> RBD client IO hangs
> >>>>>
> >>>>> Hello, this is an issue we have been suffering from and
> >>>>> researching along with a good number of other Ceph users, as
> evidenced by the recent posts.
> >>>>> In our specific case, these issues manifest themselves in a RBD ->
> >>>>> iSCSI
> >>>> LIO ->
> >>>>> ESXi configuration, but the problem is more general.
> >>>>>
> >>>>> When there is an issue on OSD nodes (examples: network
> >>>>> hangs/blips, disk HBAs failing, driver issues, page cache/XFS
> >>>>> issues), some OSDs respond slowly or with significant delays.
> >>>>> ceph osd perf does not show this,
> >>>> neither
> >>>>> does ceph osd tree, ceph -s / ceph -w.  Instead, the RBD IO hangs
> >>>>> to a
> >>>> point
> >>>>> where the client times out, crashes or displays other unsavory
> >>>>> behavior - operationally this crashes production processes.
> >>>>>
> >>>>> Today in our lab we had a disk controller issue, which brought an
> >>>>> OSD node down.  Upon restart, the OSDs started up and rejoined into
> the cluster.
> >>>>> However, immediately all IOs started hanging for a long time and
> >>>>> aborts
> >>>> from
> >>>>> ESXi -> LIO were not succeeding in canceling these IOs.  The only
> >>>>> warning
> >>>> I
> >>>>> could see was:
> >>>>>
> >>>>> root@lab2-mon1:/var/log/ceph# ceph health detail HEALTH_WARN
> 30
> >>>>> requests are blocked > 32 sec;
> >>>>> 1 osds have slow requests 30 ops are blocked > 2097.15 sec
> >>>>> 30 ops are blocked > 2097.15 sec on osd.4
> >>>>> 1 osds have slow requests
> >>>>>
> >>>>> However, ceph osd perf is not showing high latency on osd 4:
> >>>>>
> >>>>> root@lab2-mon1:/var/log/ceph# ceph osd perf osd
> >>>>> fs_commit_latency(ms)
> >>>>> fs_apply_latency(ms)
> >>>>> 0                     0                   13
> >>>>> 1                     0                    0
> >>>>> 2                     0                    0
> >>>>> 3                   172                  208
> >>>>> 4                     0                    0
> >>>>> 5                     0                    0
> >>>>> 6                     0                    1
> >>>>> 7                     0                    0
> >>>>> 8                   174                  819
> >>>>> 9                     6                   10
> >>>>> 10                     0                    1
> >>>>> 11                     0                    1
> >>>>> 12                     3                    5
> >>>>> 13                     0                    1
> >>>>> 14                     7                   23
> >>>>> 15                     0                    1
> >>>>> 16                     0                    0
> >>>>> 17                     5                    9
> >>>>> 18                     0                    1
> >>>>> 19                    10                   18
> >>>>> 20                     0                    0
> >>>>> 21                     0                    0
> >>>>> 22                     0                    1
> >>>>> 23                     5                   10
> >>>>>
> >>>>> SMART state for osd 4 disk is OK.  The OSD in up and in:
> >>>>>
> >>>>> root@lab2-mon1:/var/log/ceph# ceph osd tree
> >>>>> ID WEIGHT   TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY
> >>>>> -8        0 root ssd
> >>>>> -7 14.71997 root platter
> >>>>> -3  7.12000     host croc3
> >>>>> 22  0.89000         osd.22      up  1.00000          1.00000
> >>>>> 15  0.89000         osd.15      up  1.00000          1.00000
> >>>>> 16  0.89000         osd.16      up  1.00000          1.00000
> >>>>> 13  0.89000         osd.13      up  1.00000          1.00000
> >>>>> 18  0.89000         osd.18      up  1.00000          1.00000
> >>>>> 8  0.89000         osd.8       up  1.00000          1.00000
> >>>>> 11  0.89000         osd.11      up  1.00000          1.00000
> >>>>> 20  0.89000         osd.20      up  1.00000          1.00000
> >>>>> -4  0.47998     host croc2
> >>>>> 10  0.06000         osd.10      up  1.00000          1.00000
> >>>>> 12  0.06000         osd.12      up  1.00000          1.00000
> >>>>> 14  0.06000         osd.14      up  1.00000          1.00000
> >>>>> 17  0.06000         osd.17      up  1.00000          1.00000
> >>>>> 19  0.06000         osd.19      up  1.00000          1.00000
> >>>>> 21  0.06000         osd.21      up  1.00000          1.00000
> >>>>> 9  0.06000         osd.9       up  1.00000          1.00000
> >>>>> 23  0.06000         osd.23      up  1.00000          1.00000
> >>>>> -2  7.12000     host croc1
> >>>>> 7  0.89000         osd.7       up  1.00000          1.00000
> >>>>> 2  0.89000         osd.2       up  1.00000          1.00000
> >>>>> 6  0.89000         osd.6       up  1.00000          1.00000
> >>>>> 1  0.89000         osd.1       up  1.00000          1.00000
> >>>>> 5  0.89000         osd.5       up  1.00000          1.00000
> >>>>> 0  0.89000         osd.0       up  1.00000          1.00000
> >>>>> 4  0.89000         osd.4       up  1.00000          1.00000
> >>>>> 3  0.89000         osd.3       up  1.00000          1.00000
> >>>>>
> >>>>> How can we proactively detect this condition?  Is there anything I
> >>>>> can run that will output all slow OSDs?
> >>>>>
> >>>>> Regards,
> >>>>> Alex
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list
> >>>>> ceph-users@xxxxxxxxxxxxxx
> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-users@xxxxxxxxxxxxxx
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com