Hello Mike, i attached inline comments. Am 19.07.19 um 22:20 schrieb Mike Christie: > >> We have ~500 heavy load rbd-nbd devices in our xen cluster (rbd-nbd 12.2.5, kernel 4.4.0+10, centos clone) and ~20 high load krbd devices (kernel 4.15.0-45, ubuntu 16.04) - we never experienced problems like this. >> We only experience problems like this with rbd-nbd > 12.2.5 on ubuntu 16.04 (kernel 4.15) or ubuntu 18.04 (kernel 4.15) with erasure encoding or without. >> > Are you only using the nbd_set_timeout tool for this newer kernel combo > to try and workaround the disconnect+io_errors problem in newer kernels, > or did you use that tool to set a timeout with older kernels? I am just > trying to clarify the problem, because the kernel changed behavior and I > am not sure if your issue is the very slow IO or that the kernel now > escalates its error handler by default. I only use nbd_set_timeout with the 4.15 kernels on obuntu 16.04 and 18.04 because we experienced problems some weeks ago on "fstrim" activities a few weeks ago. Adding timeouts of 60 seconds seemed to help, but did not solve the problem completely. The problem situation described in my request is a different distuation but seems to be sourced in the same rootcause. Not using the nbd_set_timeout tool, results in the same but more prominent problem situations :-) (test with unloading the nbd module and re-executing the test) > > With older kernels no timeout would be set for each command by default, > so if you were not running that tool then you would not see the nbd > disconnect+io_errors+xfs issue. You would just see slow IOs. > > With newer kernels, like 4.15, nbd.ko always sets a per command timeout > even if you do not set it via a nbd ioctl/netlink command. By default > the timeout is 30 seconds. After the timeout period then the kernel does > that disconnect+IO_errors error handling which causes xfs to get errors. > Did i get you correctly: Setting a unlimited timeout should prevent crashes on kernel 4.15? For testing purposes i set the timeout to unlimited ("nbd_set_ioctl /dev/nbd0 0", on already mounted device). I re-executed the problem procedure and discovered that the compression-procedure crashes not at the same file, but crashes 30 seconds later with the same crash behavior. Regards Marc _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com