reproducible rbd-nbd crashes

Marc Schöchlin <ms@xxxxxxxxxx> · Wed, 14 Aug 2019 14:35:51 +0200

    Hello Mike,
    see my inline comments.

    Am 14.08.19 um 02:09 schrieb Mike
      Christie:

          -----
Previous tests crashed in a reproducible manner with "-P 1" (single io gzip/gunzip) after a few minutes up to 45 minutes.

Overview of my tests:

- SUCCESSFUL: kernel 4.15, ceph 12.2.5, 1TB ec-volume, ext4 file system, 120s device timeout
  -> 18 hour testrun was successful, no dmesg output
- FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created without reboot
  -> parallel krbd device usage with 99% io usage worked without a problem while running the test
- FAILED: kernel 4.15, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created
  -> parallel krbd device usage with 99% io usage worked without a problem while running the test
- FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, no timeout
  -> failed after < 10 minutes
  -> system runs in a high system load, system is almost unusable, unable to shutdown the system, hard reset of vm necessary, manual exclusive lock removal is necessary before remapping the device

    There is something new compared to yesterday.....three days ago i
      downgraded a production system to client version 12.2.5.

      This night also this machine crashed. So it seems that rbd-nbd is
      broken in general also with release 12.2.5 and potentially before.

    The new (updated) list:

    - FAILED: kernel 4.15, ceph 12.2.5, 2TB ec-volume, ext4 file
      system, 120s device timeout

      -> crashed in production while snapshot trimming is
      running on that pool

    - FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system,
    120s device timeout

      -> failed after < 1 hour, rbd-nbd map/device is gone, mount
    throws io errors, map/mount can be re-created without reboot

      -> parallel krbd device usage with 99% io usage worked without
    a problem while running the test

    - FAILED: kernel 4.15, ceph 12.2.11, 2TB ec-volume, xfs file system,
    120s device timeout

      -> failed after < 1 hour, rbd-nbd map/device is gone, mount
    throws io errors, map/mount can be re-created

      -> parallel krbd device usage with 99% io usage worked without
    a problem while running the test

    - FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system,
    no timeout

      -> failed after < 10 minutes

      -> system runs in a high system load, system is almost
    unusable, unable to shutdown the system, hard reset of vm necessary,
    manual exclusive lock removal is necessary before remapping the
    device

    - FAILED: kernel 4.4, ceph 12.2.11, 2TB 3-replica-volume, xfs file
    system, 120s device timeout

      -> failed after < 1 hour, rbd-nbd map/device is gone, mount
    throws io errors, map/mount can be re-created

    - FAILED: kernel 5.0, ceph 12.2.12, 2TB ec-volume, ext4 file system,
    120s device timeout

      -> failed after < 1 hour, rbd-nbd map/device is gone, mount
    throws io errors, map/mount can be re-created

          - FAILED: kernel 4.4, ceph 12.2.11, 2TB 3-replica-volume, xfs file system, 120s device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created

        How many CPUs and how much memory does the VM have?

    Charateristic of the crashed vm machine:

      Ubuntu 18.04, with kernel 4.15, Ceph Client 12.2.5
      Services: NFS kernel Server, nothing else
      Crash behavior:

        daily Task for snapshot creation/deletion started at 19:00
        a daily database backup started at 19:00, this created

          120 IOPS write, and 1 IOPS read
          22K/sectors per second write, 0 sectors/per second
          97 MBIT inbound and 97 MBIT outbound network usage (nfs
            server)

        we had slow requests at the time of the crash

        rbd-nbd process terminated 25min later without segfault
        the nfs usage created a 5 min load of 10 from start, 5K
          context switches/sec

        memory usage (kernel+userspace) was 10% of the system
        no swap usage

      ceph.conf

        [client]

        rbd cache = true

        rbd cache size = 67108864

        rbd cache max dirty = 33554432

        rbd cache target dirty = 25165824

        rbd cache max dirty age = 3

        rbd readahead max bytes = 4194304

        admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok

      4 CPUs
      6 GB RAM
      Non default Sysctl Settings

        vm.swappiness = 1

        fs.aio-max-nr = 262144

        fs.file-max = 1000000

        kernel.pid_max = 4194303

        vm.zone_reclaim_mode = 0

        kernel.randomize_va_space = 0

        kernel.panic = 0

        kernel.panic_on_oops = 0

        I'm not sure which test it covers above, but for
test-with-timeout/ceph-client.archiv.log and dmesg-crash it looks like
the command that probably triggered the timeout got stuck in safe_write
or write_fd, because we see:

// Command completed and right after this log message we try to write
the reply and data to the nbd.ko module.

2019-07-29 21:55:21.148118 7fffbf7fe700 20 rbd-nbd: writer_entry: got:
[4500000000000000 READ 24043755000~20000 0]

// We got stuck and 2 minutes go by and so the timeout fires. That kills
the socket, so we get an error here and after that rbd-nbd is going to exit.

2019-07-29 21:57:21.785111 7fffbf7fe700 -1 rbd-nbd: [4500000000000000
READ 24043755000~20000 0]: failed to write replay data: (32) Broken pipe

We could hit this in a couple ways:

1. The block layer sends a command that is larger than the socket's send
buffer limits. These are those values you sometimes set in sysctl.conf like:

net.core.rmem_max
net.core.wmem_max
net.core.rmem_default
net.core.wmem_default
net.core.optmem_max

    see attached file.

        There does not seem to be any checks/code to make sure there is some
alignment with limits. I will send a patch but that will not help you
right now. The max io size for nbd is 128k so make sure your net values
are large enough. Increase the values in sysctl.conf and retry if they
were too small.

      Not sure what I was thinking. Just checked the logs and we have done IO
of the same size that got stuck and it was fine, so the socket sizes
should be ok.

We still need to add code to make sure IO sizes and the af_unix sockets
size limits match up.

        2. If memory is low on the system, we could be stuck trying to allocate
memory in the kernel in that code path too.

    memory was definitely not low, we only had 10% memory usage at the
    time of the crash.

        rbd-nbd just uses more memory per device, so it could be why we do not
see a problem with krbd.

3. I wonder if we are hitting a bug with PF_MEMALLOC Ilya hit with krbd.
He removed that code from the krbd. I will ping him on that.

    Interesting. I activated Coredumps for that processes - probably
      we can find something interesting here...

    Regards

      Marc

Attachment:
sysctl_settings.txt.gz

Description: application/gzip
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com