Re: reproducable rbd-nbd crashes

Marc Schöchlin <ms@xxxxxxxxxx> · Mon, 22 Jul 2019 13:00:47 +0200

Hello Mike,

i attached inline comments.

Am 19.07.19 um 22:20 schrieb Mike Christie:
>
>> We have ~500 heavy load rbd-nbd devices in our xen cluster (rbd-nbd 12.2.5, kernel 4.4.0+10, centos clone) and ~20 high load krbd devices (kernel 4.15.0-45, ubuntu 16.04) - we never experienced problems like this.
>> We only experience problems like this with rbd-nbd > 12.2.5 on ubuntu 16.04 (kernel 4.15) or ubuntu 18.04 (kernel 4.15) with erasure encoding or without.
>>
> Are you only using the nbd_set_timeout tool for this newer kernel combo
> to try and workaround the disconnect+io_errors problem in newer kernels,
> or did you use that tool to set a timeout with older kernels? I am just
> trying to clarify the problem, because the kernel changed behavior and I
> am not sure if your issue is the very slow IO or that the kernel now
> escalates its error handler by default.
I only use nbd_set_timeout with the 4.15 kernels on obuntu 16.04 and 18.04 because we experienced problems some weeks ago on "fstrim" activities a few weeks ago.
Adding timeouts of 60 seconds seemed to help, but did not solve the problem completely.

The problem situation described in my request is a different distuation but seems to be sourced in the same rootcause.

Not using the nbd_set_timeout tool, results in the same but more prominent problem situations :-)
(test with unloading the nbd module and re-executing the test)
>
> With older kernels no timeout would be set for each command by default,
> so if you were not running that tool then you would not see the nbd
> disconnect+io_errors+xfs issue. You would just see slow IOs.
>
> With newer kernels, like 4.15, nbd.ko always sets a per command timeout
> even if you do not set it via a nbd ioctl/netlink command. By default
> the timeout is 30 seconds. After the timeout period then the kernel does
> that disconnect+IO_errors error handling which causes xfs to get errors.
>
Did i get you correctly: Setting a unlimited timeout should prevent crashes on kernel 4.15?

For testing purposes i set the timeout to unlimited ("nbd_set_ioctl /dev/nbd0 0", on already mounted device).
I re-executed the problem procedure and discovered that the compression-procedure crashes not at the same file, but crashes 30 seconds later with the same crash behavior.

Regards
Marc

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com