Re: reproducible rbd-nbd crashes

Marc Schöchlin <ms@xxxxxxxxxx> · Tue, 13 Aug 2019 17:58:38 +0200

    Hello Jason,
    thanks for your response.

      See my inline comments.

    Am 31.07.19 um 14:43 schrieb Jason
      Dillaman:

      On Wed, Jul 31, 2019 at 6:20 AM Marc Schöchlin <ms@xxxxxxxxxx> wrote:

The problem not seems to be related to kernel releases, filesystem types or the ceph and network setup.
Release 12.2.5 seems to work properly, and at least releases >= 12.2.10 seems to have the described problem.

       ...

It's basically just a log message tweak and some changes to how the
process is daemonized. If you could re-test w/ each release after
12.2.5 and pin-point where the issue starts occurring, we would have
something more to investigate.

    Are there changes related to
      https://tracker.ceph.com/issues/23891?

      You showed me the very low amount of changes in rbd-nbd.

      What about librbd, librados, ...?

      What else can we do to find a detailed reason for the crash?

      Do you think it is useful to activate coredump-creation for that
      process?

        Whats next? Is i a good idea to do a binary search between 12.2.12 and 12.2.5?

    Due to the absence of a coworker i almost had no capacity to
      execute deeper tests with this problem.

      But i can say that in reproduced the problem also with release
      12.2.12.

    The new (updated) list:

    - SUCCESSFUL: kernel 4.15, ceph 12.2.5, 1TB ec-volume, ext4 file system, 120s device timeout
  -> 18 hour testrun was successful, no dmesg output
- FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created without reboot
  -> parallel krbd device usage with 99% io usage worked without a problem while running the test
- FAILED: kernel 4.15, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created
  -> parallel krbd device usage with 99% io usage worked without a problem while running the test
- FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, no timeout
  -> failed after < 10 minutes
  -> system runs in a high system load, system is almost unusable, unable to shutdown the system, hard reset of vm necessary, manual exclusive lock removal is necessary before remapping the device
- FAILED: kernel 4.4, ceph 12.2.11, 2TB 3-replica-volume, xfs file system, 120s device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created
- FAILED: kernel 5.0, ceph 12.2.12, 2TB ec-volume, ext4 file system, 120s device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created

Regards
Marc

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com