Re: Update on crash with kernel 3.19

Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> · Tue, 2 Jun 2015 09:11:32 -0400

Hi Mike,

Thank you for your responses.  I have sent last night's logs in a
separate email.  Basically, in absence of any ceph related warnings,
there seems to be a large set of ESXi rescans, then a message such as:

Jun  1 22:57:03 roc-4r-scd212 kernel: [531823.664766] iSCSI Login
timeout on Network Portal 10.70.2.211:3260

Then more errors.  I have attached the syslog from the LIO node, as
well as logs from one of the six ESXi nodes, but this one has the most
storage.

Do you think there is an ESXi scan storm or such that breaks LIO?  Or
is this still a ceph related issue?  I did change the osd op complaint
time as you indicated, it was set to 30 and I set it to 10.

Thank you,
Alex

On Mon, Jun 1, 2015 at 10:55 PM, Mike Christie <mchristi@xxxxxxxxxx> wrote:
> On 06/01/2015 09:18 AM, Alex Gorbachev wrote:
>> Hi Mike,
>>
>> On Fri, May 29, 2015 at 3:11 PM, Mike Christie <mchristi@xxxxxxxxxx> wrote:
>>> Just want to make sure we are on the same page. The logs you have been
>>> posting the last week do not show crashes do they (there are no null
>>> pointers or kernel lock ups)? They are just showing error handling
>>> issues. I just want to make sure you are no longer seeing the crashes
>>> with the newer kernels like we saw before.
>>
>> That is true, it may be due to the newer kernels.  We also implemented
>> a fast monitoring system, which alerts us whenever there is an issue -
>> so that may play a factor as we would proactively reboot or fail over
>> services.  I did notice that after a few timeouts the system becomes
>> sluggish or unresponsive, and to clear that it takes a reboot.  The
>> good news is that we are seeing fewer timeouts overall (none last
>> week).
>>
>>> For the timeouts/abort_tasks, the issue is either that LIO box is locked
>>> up in some way that RBD IO is not executing like we saw with the
>>> workqueue issues in older kernels, or there is just some issue on the
>>> OSD boxes or connection to them that is not allowing IO to execute like
>>> normal. This is the normal case and for this, you need to post the logs
>>> on the ceph list if there is anything suspicious.
>>
>> Is there any way to tell which backstore device is causing the
>> timeouts by the LIO messages?  Then we could map it to RBD and then
>> OSD and PG.  As it is right now, ceph logs are not showing any errors
>> when these events happen.
>
> On the LIO box you could try turning on more debugging, but if you just
> look on the ESXi box vmkernel log you have the info.
>
> On the ESXi box in the logs you see error messages with the device id like:
>
> naa.6001405a37e54d400000000000000000
>
> or something with the host, channel, target lun identifier like
>
>  iscsi_vmk:
> iscsivmk_TaskMgmtIssue: vmhba33:CH:3 T:22 L:1 : Task mgmt "Abort Task"
> with itt=0x10d4450 (refITT=0x10d444f) timed ou
>
> You just then just map that to the device on the LIO box. So for
> example, if you know the LUN (1 in the example above), then just look at
> the LIO target config and see what rbd device you exported as lun 1.
>
> What is your value for:
>
> osd op complaint time
>
> Lower it to around 10 or 15 secs, so it is lower than the ESXi timeout
> that would result in abort tasks being sent.
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html