Re: Update on crash with kernel 3.19

Mike Christie <mchristi@xxxxxxxxxx> · Mon, 01 Jun 2015 21:55:34 -0500

On 06/01/2015 09:18 AM, Alex Gorbachev wrote:
> Hi Mike,
> 
> On Fri, May 29, 2015 at 3:11 PM, Mike Christie <mchristi@xxxxxxxxxx> wrote:
>> Just want to make sure we are on the same page. The logs you have been
>> posting the last week do not show crashes do they (there are no null
>> pointers or kernel lock ups)? They are just showing error handling
>> issues. I just want to make sure you are no longer seeing the crashes
>> with the newer kernels like we saw before.
> 
> That is true, it may be due to the newer kernels.  We also implemented
> a fast monitoring system, which alerts us whenever there is an issue -
> so that may play a factor as we would proactively reboot or fail over
> services.  I did notice that after a few timeouts the system becomes
> sluggish or unresponsive, and to clear that it takes a reboot.  The
> good news is that we are seeing fewer timeouts overall (none last
> week).
> 
>> For the timeouts/abort_tasks, the issue is either that LIO box is locked
>> up in some way that RBD IO is not executing like we saw with the
>> workqueue issues in older kernels, or there is just some issue on the
>> OSD boxes or connection to them that is not allowing IO to execute like
>> normal. This is the normal case and for this, you need to post the logs
>> on the ceph list if there is anything suspicious.
> 
> Is there any way to tell which backstore device is causing the
> timeouts by the LIO messages?  Then we could map it to RBD and then
> OSD and PG.  As it is right now, ceph logs are not showing any errors
> when these events happen.

On the LIO box you could try turning on more debugging, but if you just
look on the ESXi box vmkernel log you have the info.

On the ESXi box in the logs you see error messages with the device id like:

naa.6001405a37e54d400000000000000000

or something with the host, channel, target lun identifier like

 iscsi_vmk:
iscsivmk_TaskMgmtIssue: vmhba33:CH:3 T:22 L:1 : Task mgmt "Abort Task"
with itt=0x10d4450 (refITT=0x10d444f) timed ou

You just then just map that to the device on the LIO box. So for
example, if you know the LUN (1 in the example above), then just look at
the LIO target config and see what rbd device you exported as lun 1.

What is your value for:

osd op complaint time

Lower it to around 10 or 15 secs, so it is lower than the ESXi timeout
that would result in abort tasks being sent.
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html