Hello Jason, it seems that there is something wrong in the rbd-nbd implementation. (added this information also at https://tracker.ceph.com/issues/40822) The problem not seems to be related to kernel releases, filesystem types or the ceph and network setup. Release 12.2.5 seems to work properly, and at least releases >= 12.2.10 seems to have the described problem. This night a 18 hour testrun with the following procedure was successful: ----- #!/bin/bash set -x while true; do date find /srv_ec -type f -name "*.MYD" -print0 |head -n 50|xargs -0 -P 10 -n 2 gzip -v date find /srv_ec -type f -name "*.MYD.gz" -print0 |head -n 50|xargs -0 -P 10 -n 2 gunzip -v done ----- Previous tests crashed in a reproducible manner with "-P 1" (single io gzip/gunzip) after a few minutes up to 45 minutes. Overview of my tests: - SUCCESSFUL: kernel 4.15, ceph 12.2.5, 1TB ec-volume, ext4 file system, 120s device timeout -> 18 hour testrun was successful, no dmesg output - FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s device timeout -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created without reboot -> parallel krbd device usage with 99% io usage worked without a problem while running the test - FAILED: kernel 4.15, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s device timeout -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created -> parallel krbd device usage with 99% io usage worked without a problem while running the test - FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, no timeout -> failed after < 10 minutes -> system runs in a high system load, system is almost unusable, unable to shutdown the system, hard reset of vm necessary, manual exclusive lock removal is necessary before remapping the device - FAILED: kernel 4.4, ceph 12.2.11, 2TB 3-replica-volume, xfs file system, 120s device timeout -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created All device timeouts were set separately set by the nbd_set_ioctl tool because luminous rbd-nbd does not provide the possibility to define timeouts. Whats next? Is i a good idea to do a binary search between 12.2.12 and 12.2.5? >From my point of view (without in depth-knowledge of rbd-nbd/librbd) my assumption is that this problem might be caused by rbd-nbd code and not by librbd. The probability that a bug like this survives uncovered in librbd for such a long time seems to be low for me :-) Regards Marc Am 29.07.19 um 22:25 schrieb Marc Schöchlin: > Hello Jason, > > i updated the ticket https://tracker.ceph.com/issues/40822 > > Am 24.07.19 um 19:20 schrieb Jason Dillaman: >> On Wed, Jul 24, 2019 at 12:47 PM Marc Schöchlin <ms@xxxxxxxxxx> wrote: >>> Testing with a 10.2.5 librbd/rbd-nbd ist currently not that easy for me, because the ceph apt source does not contain that version. >>> Do you know a package source? >> All the upstream packages should be available here [1], including 12.2.5. > Ah okay, i will test this tommorow. >> Did you pull the OSD blocked ops stats to figure out what is going on >> with the OSDs? > Yes, see referenced data in the ticket https://tracker.ceph.com/issues/40822#note-15 > > Regards > Marc > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com